HTTP(s) Data Connector
The HTTP(s) Data Connector enables federated SQL query across supported file formats stored at an HTTP(s) endpoint. The connector supports dynamic query and data refresh through SQL-based filtering.
datasets:
- from: http://static_username@localhost:3001/report.csv
name: local_report
params:
http_password: ${env:MY_HTTP_PASS}
Examplesโ
Basic Exampleโ
datasets:
- from: https://github.com/LAION-AI/audio-dataset/raw/7fd6ae3cfd7cde619f6bed817da7aa2202a5bc28/metadata/freesound/parquet/freesound_parquet.parquet
name: laion_freesound
Using Basic Authenticationโ
datasets:
- from: http://static_username@localhost:3001/report.csv
name: local_report
params:
http_password: ${env:MY_HTTP_PASS}
Using Custom Headersโ
Custom HTTP headers can be specified for authentication, API keys, or other requirements. Headers are treated as sensitive data and will not be logged.
datasets:
- from: https://api.example.com/data.csv
name: api_data
params:
http_headers: 'Authorization:Bearer ${secrets:api_token},Accept:application/json'
Headers can also be separated by semicolons:
datasets:
- from: https://api.example.com/data.csv
name: api_data
params:
http_headers: 'Authorization: Bearer ${secrets:api_token}; X-API-Key: ${secrets:api_key}'
Using OAuth2 Refresh-Token Authenticationโ
For JSON APIs protected by OAuth2, the connector can exchange a long-lived refresh token for short-lived access tokens and keep them fresh automatically (RFC 6749 ยง6). On startup Spice hits the configured token endpoint once, then stamps Authorization: Bearer <access_token> on every data request and refreshes the token in the background before it expires.
datasets:
- from: https://api.example.com
name: secure_data
params:
file_format: json
allowed_request_paths: '/v1/**'
auth_token_url: https://auth.example.com/oauth/token
http_auth_refresh_token: ${secrets:my_refresh_token}
http_auth_client_id: ${secrets:my_client_id}
http_auth_client_secret: ${secrets:my_client_secret}
auth_scopes: 'read:data offline_access'
The http_auth_refresh_token, http_auth_client_id, and http_auth_client_secret parameters can be loaded from any supported secret store (environment variables, Kubernetes Secrets, AWS Secrets Manager, HashiCorp Vault, the OS keychain, etc.) using the ${secrets:...} replacement syntax.
Applies to JSON API endpoints (e.g. file_format: json). Structured file formats (csv/parquet/etc.) go through the object-store listing path and are not affected by this setting โ use http_headers for those.
See OAuth2 Refresh-Token Authentication for the full parameter reference and behavior notes.
Configurationโ
fromโ
The from field specifies the HTTP(s) endpoint and can be configured in two ways:
-
Direct URL to a file: A complete URL pointing to a specific supported file.
from: https://example.com/data/report.csv -
Base domain/path: A base URL that will be combined with special metadata fields to construct the complete request.
from: https://api.example.com/v1
The connector supports templated URLs with query parameters that can be dynamically populated using refresh_sql filters and special metadata fields.
nameโ
The dataset name. This will be used as the table name within Spice.
Example:
datasets:
- from: http://static_username@localhost:3001/report.csv
name: cool_dataset
params: ...
SELECT COUNT(*) FROM cool_dataset;
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+
The dataset name cannot be a reserved keyword.
paramsโ
The connector supports authentication, timeout, connection pooling, and retry configuration via params.
| Parameter Name | Description |
|---|---|
http_port | Optional. Port to create HTTP(s) connection over. Default: 80 and 443 for HTTP and HTTPS respectively. |
http_username | Optional. Username for HTTP basic authentication. Default: None. |
http_password | Optional. Password for HTTP basic authentication. Default: None. Use the secret replacement syntax to load the password from a secret store, e.g. ${secrets:my_http_pass}. |
http_headers | Optional. Custom HTTP headers as a comma-separated list of key:value pairs. Example: Content-Type:application/json,Accept:application/json. Default: None. |
allowed_request_paths | Required for using request_path filters. Comma-separated list of allowed paths. Example: /api/users,/api/posts. Paths must start with / and cannot contain .. segments. |
request_query_filters | Optional. Set to enabled to enable request_query filters. Default: disabled. When disabled, query parameter filters will be rejected. |
request_body_filters | Optional. Set to enabled to enable request_body filters for POST requests. Default: disabled. When disabled, request body filters will be rejected. |
client_timeout | Optional. Maximum time to wait for a response from the HTTP server (in seconds). Default: 30. Applied to the entire request-response cycle. |
connect_timeout | Optional. Timeout for establishing HTTP(s) connections (in seconds). Default: 10. |
pool_max_idle_per_host | Optional. Maximum number of idle connections to keep alive per host. Default: 10. |
pool_idle_timeout | Optional. Timeout for idle connections in the pool (in seconds). Default: 90. |
max_retries | Optional. Maximum number of retries for failed HTTP requests. Default: 3. |
retry_backoff_method | Optional. Retry backoff strategy: fibonacci (default), linear, or exponential. |
retry_max_duration | Optional. Maximum total duration for all retries (e.g., 30s, 5m). If not set, retries continue up to max_retries. |
retry_jitter | Optional. Randomization factor for retry delays (0.0 to 1.0). Default: 0.3 (30% randomization). Set to 0 for no jitter. |
max_request_query_length | Optional. Maximum length in characters for request_query filter values. Default: 1024. Maximum: 4096. |
max_request_body_bytes | Optional. Maximum size in bytes for request_body filter values. Default: 16384 (16 KiB). Maximum: 65536 (64 KiB). |
health_probe | Optional. Custom health probe path for endpoint validation during initialization (e.g., /health, /api/status). The endpoint must return a 2xx status code to pass validation. If not set, a random path is used and any status (including 404) is accepted. Must start with /. |
auth_token_url | Optional. OAuth2 token endpoint URL (must be HTTPS; http://localhost and loopback IPs are allowed for local testing). When set together with http_auth_refresh_token, the connector exchanges the refresh token for short-lived access tokens and attaches Authorization: Bearer <token> to all data requests. Applies to JSON API endpoints only. See OAuth2 Refresh-Token Authentication. |
http_auth_refresh_token | Optional. OAuth2 refresh token exchanged against auth_token_url to obtain access tokens. Required when auth_token_url is set. Use a secret store, e.g. ${secrets:my_refresh_token}. |
http_auth_client_id | Optional. OAuth2 client_id presented to the token endpoint. Required for confidential clients; optional for public clients. Must be paired with http_auth_client_secret for confidential clients. |
http_auth_client_secret | Optional. OAuth2 client_secret presented to the token endpoint. Required when the client is confidential; must be set together with http_auth_client_id. Use a secret store, e.g. ${secrets:my_client_secret}. |
auth_scopes | Optional. Space-separated OAuth2 scopes to request when refreshing (e.g. read:data offline_access). Omit to inherit the scopes bound to the refresh token. |
auth_client_auth | Optional. How client credentials are sent to the token endpoint: basic (HTTP Basic header, default per RFC 6749 ยง2.3.1) or body (client_id/client_secret in the form body). Default: basic. |
Pagination Parametersโ
| Parameter Name | Description |
|---|---|
pagination | Optional. Pagination mode: auto (default) auto-detects Link headers, enabled explicitly enables pagination with configuration below, disabled turns off pagination. |
pagination_next_pointer | Optional. JSON pointer (RFC 6901) to the next page URL or cursor in the response body (e.g., /next, /pagination/cursor, /links/next). |
pagination_link_header | Optional. Whether to follow HTTP Link headers with rel="next" for pagination. Default: enabled. Set to disabled to ignore Link headers. |
pagination_token_param | Optional. When set, the value from pagination_next_pointer is treated as a cursor/token and passed as this query parameter name in subsequent requests. When not set, the value is treated as a full URL. |
pagination_data_pointer | Optional. JSON pointer (RFC 6901) to the data array in each page's response (e.g., /data, /results, /items). When set, only the array at this path is returned as data rows. |
pagination_max_pages | Optional. Maximum number of pages to fetch. Default: 100. |
pagination_data_map_to_array | Optional. When enabled, if the data at pagination_data_pointer (or the top-level response) is a JSON object/map, extracts its values as rows instead of treating it as a single row. Default: disabled. Requires pagination to be enabled. |
pagination_query_params | Optional. Query parameter template for client-driven pagination. Supports {offset}, {limit}, and {page} variables (e.g., offset={offset}&limit={limit}). Requires pagination_page_size. Mutually exclusive with pagination_next_pointer and pagination_token_param. |
pagination_page_size | Optional. Number of items per page for query-parameter pagination. Must be a positive integer. Expands {limit} in pagination_query_params and detects the last page (fewer results than page_size means done). Requires pagination_query_params. |
HTTP Response Headersโ
When querying HTTP(s) datasets, Spice respects standard HTTP caching headers in responses. The connector supports the following cache-related response headers:
Cache-Controlโ
The Cache-Control response header from the HTTP(s) endpoint is passed through to clients querying Spice. When the HTTP(s) server returns a Cache-Control header with the stale-while-revalidate directive, clients can use this value to determine appropriate caching behavior.
For example, if the HTTP(s) endpoint returns:
Cache-Control: max-age=10, stale-while-revalidate=10
Clients querying Spice will receive this header and can:
- Serve fresh data for 10 seconds after fetching.
- Between 10-20 seconds, serve stale data while fetching fresh data in the background.
- After 20 seconds, fetch fresh data before serving the next request.
The stale-while-revalidate behavior in Spice is controlled by the stale_while_revalidate_ttl parameter in the caching configuration. When stale_while_revalidate_ttl is set to 0 (default), stale data will not be served. When set to a non-zero value, Spice serves stale cache entries while revalidating in the background.
Advanced Featuresโ
The HTTP connector provides advanced capabilities for working with dynamic APIs and RESTful services, including built-in pagination and special metadata fields.
Paginationโ
The HTTP connector supports automatic pagination for REST APIs that return data across multiple pages. Pagination is configured via params and works transparently with acceleration (caching, append, and full refresh modes) โ each page is streamed as a separate batch without buffering entire result sets in memory.
Pagination Modesโ
There are three pagination modes:
URL mode โ The next page URL is extracted from the response body (via pagination_next_pointer) or from the HTTP Link header with rel="next".
datasets:
- from: https://api.example.com/v1/items
name: items
params:
pagination: enabled
pagination_next_pointer: /links/next
pagination_data_pointer: /data
pagination_max_pages: 50
Token mode โ A cursor/token is extracted from the response body (via pagination_next_pointer) and passed as a query parameter (specified by pagination_token_param) in subsequent requests.
datasets:
- from: https://api.example.com/v1/items
name: items
params:
pagination: enabled
pagination_next_pointer: /pagination/cursor
pagination_token_param: cursor
pagination_data_pointer: /results
Query-parameter mode โ The client drives pagination by expanding a template (pagination_query_params) with {offset}, {limit}, and {page} variables. Pagination stops when a page returns fewer rows than pagination_page_size.
datasets:
- from: https://api.example.com/v1/widgets
name: widgets
params:
pagination: enabled
pagination_query_params: "offset={offset}&limit={limit}"
pagination_page_size: "100"
pagination_max_pages: "50"
Map-to-Array Conversionโ
Some APIs return data as a JSON object/map (e.g., {"1": {...}, "2": {...}}) instead of an array. Set pagination_data_map_to_array: enabled to extract the map values as individual rows.
datasets:
- from: https://api.example.com/v1/records
name: records
params:
pagination: enabled
pagination_data_map_to_array: enabled
pagination_query_params: "offset={offset}&limit={limit}"
pagination_page_size: "100"
Auto Modeโ
By default, pagination is set to auto, which automatically follows HTTP Link headers with rel="next" if present in responses. Set pagination: disabled to turn off all pagination behavior, or pagination: enabled to explicitly configure pagination with the parameters above.
Validation Rulesโ
pagination_query_paramsrequirespagination_page_size(and vice versa)pagination_query_paramsis mutually exclusive withpagination_next_pointerandpagination_token_parampagination_query_paramsmust contain{offset}or{page}to ensure pages advancepagination_token_paramrequirespagination_next_pointerpagination_next_pointerandpagination_data_pointermust be valid JSON pointers starting with/
SSRF Protectionโ
When using URL mode, next-page URLs extracted from response bodies are validated to share the same origin as the base URL configured in from. Cross-origin redirects are rejected.
Special Metadata Fieldsโ
The HTTP connector supports special metadata fields that provide fine-grained control over HTTP requests. These fields can be included in your dataset schema to dynamically construct request URLs and payloads.
For security, these metadata fields require explicit configuration to prevent unauthorized access:
request_pathrequiresallowed_request_pathsto be configured with glob patternsrequest_queryrequiresrequest_query_filters: enabledrequest_bodyrequiresrequest_body_filters: enabled
| Field Name | Type | Description |
|---|---|---|
request_path | String | Specifies the URL path to append to the base URL from the from field. When using a base domain/path in from, request_path constructs the complete endpoint. Example: If from: https://api.example.com and request_path: /users/123, the request will be made to https://api.example.com/users/123. Requires allowed_request_paths parameter. |
request_query | String | Defines query parameters to append to the request URL. Formatted as a query string (e.g., key1=value1&key2=value2). These parameters are appended to the URL after any path specified in request_path. Requires request_query_filters: enabled. Maximum length: configurable via max_request_query_length (default: 1024 characters). |
request_body | String | Contains the request body for POST/PUT requests. Typically used with REST APIs that require a JSON or form-encoded payload. The content type should be specified using http_headers. Requires request_body_filters: enabled. Maximum size: configurable via max_request_body_bytes (default: 16 KiB). |
These metadata fields work in combination:
- If
fromspecifies a complete file URL, these fields are ignored - If
fromspecifies a base URL, these fields construct the full request dynamically request_pathis appended to the base URLrequest_queryis appended as query parametersrequest_bodyis sent as the request payload (requires appropriate HTTP method configuration)
Response Metadata Fieldsโ
In addition to request metadata, the HTTP connector includes response metadata fields in the dataset schema. These fields capture information about the HTTP response and are available in SQL queries.
| Field Name | Type | Description |
|---|---|---|
content | String | The response body content. |
response_status | UInt16 | The HTTP status code of the response (e.g., 200, 404, 500). |
response_headers | Map(String, String) | The HTTP response headers as key-value pairs. Each header name maps to its value. Available for inspection in queries, e.g., to check content-type or custom headers returned by the API. |
fetched_at | Timestamp (Nanosecond) | The timestamp when the data was fetched. Uses the HTTP Date response header when available, falling back to the current system time. |
Querying Response Metadataโ
-- Check the HTTP status of cached responses
SELECT request_path, response_status, fetched_at
FROM my_http_dataset;
-- Inspect response headers
SELECT request_path, response_headers
FROM my_http_dataset
WHERE request_path = '/api/data';
When using caching refresh mode, transient HTTP error responses (5xx server errors and 429 Too Many Requests) are automatically excluded from the cache. These responses are still returned to the querying client but are not persisted, preventing temporary failures from polluting cached data.
Endpoint Validationโ
The HTTP connector validates the configured endpoint during initialization to detect issues such as DNS errors, connection problems, or invalid URLs early in the startup process.
Default Validation Behaviorโ
By default, the connector performs a health check by requesting a randomly generated path (e.g., /__spice_health_check_abc123def456) that is expected to return a 404 status. Any HTTP response, including 404 Not Found, indicates that the endpoint is reachable and the dataset will initialize successfully.
This default behavior works for most HTTP endpoints but may not be suitable for APIs that:
- Return error responses for unknown paths without proper HTTP status codes
- Have strict path validation that rejects requests to non-existent endpoints
- Require authentication for all paths, including health check endpoints
Custom Health Probeโ
For endpoints that require a specific health check path, configure the health_probe parameter:
datasets:
- from: https://api.example.com/v1
name: api_data
params:
health_probe: /health
When a custom health probe is configured:
- The connector validates the endpoint by requesting the specified path
- The health probe endpoint must return a 2xx status code (200-299) for validation to succeed
- If the health probe returns a non-2xx status code, the dataset will fail to initialize with an error message
This provides more reliable validation for APIs with dedicated health check endpoints.
Example with Authenticationโ
datasets:
- from: https://api.example.com
name: authenticated_api
params:
http_headers: 'Authorization:Bearer ${secrets:api_token}'
health_probe: /api/status
In this configuration, the health probe request to /api/status will include the authentication header, ensuring that the validation succeeds for APIs that require authentication on all endpoints.
Health Probe Requirementsโ
The health_probe parameter has the following requirements:
- Must start with
/ - Cannot exceed 2048 characters in length
- The target endpoint must return a 2xx HTTP status code for validation to succeed
OAuth2 Refresh-Token Authenticationโ
The HTTP connector supports the OAuth2 refresh-token grant (RFC 6749 ยง6) for JSON APIs. Given a long-lived refresh token and a token endpoint, Spice will:
- Exchange the refresh token for an access token at dataset startup.
- Attach
Authorization: Bearer <access_token>to every data request. - Refresh the access token in the background, 60 seconds before it expires, for the lifetime of the process.
- Honor rotated refresh tokens โ when the token endpoint returns a new
refresh_token, Spice uses it for the next exchange.
This flow is designed for APIs where the operator already has a refresh token in hand (e.g. issued via a separate authorization-code or device-code flow). Spice does not perform an interactive authorization flow, nor does it retry data requests on 401 โ keeping the token continuously fresh in the background is the only recovery path.
Basic Configurationโ
datasets:
- from: https://api.example.com
name: secure_data
params:
file_format: json
allowed_request_paths: '/v1/**'
auth_token_url: https://auth.example.com/oauth/token
http_auth_refresh_token: ${secrets:my_refresh_token}
http_auth_client_id: ${secrets:my_client_id}
http_auth_client_secret: ${secrets:my_client_secret}
Parameter Referenceโ
| Parameter | Kind | Required | Description |
|---|---|---|---|
auth_token_url | runtime | yes (for OAuth) | OAuth2 token endpoint URL. Must be HTTPS; http://localhost, http://127.0.0.1, and http://[::1] are accepted for local testing. |
http_auth_refresh_token | component, secret | yes (for OAuth) | Long-lived refresh token. Exchanged on startup for the first access token. Can be loaded from any supported secret store via ${secrets:...}. |
http_auth_client_id | component, secret | confidential | client_id. Required for confidential clients, optional for public clients. When set together with http_auth_client_secret, both are sent to the token endpoint. |
http_auth_client_secret | component, secret | confidential | client_secret. Must be paired with http_auth_client_id. Can be loaded from any supported secret store via ${secrets:...}. |
auth_scopes | runtime | no | Space-separated OAuth2 scopes (e.g. read:data offline_access). Omit to inherit the scopes bound to the refresh token. |
auth_client_auth | runtime | no | How client credentials are sent to the token endpoint: basic (default, HTTP Basic header per RFC 6749 ยง2.3.1) or body (as client_id/client_secret form fields). |
Component/secret parameters carry the http_ prefix when set in a dataset (http_auth_refresh_token, http_auth_client_id, http_auth_client_secret). Runtime parameters do not (auth_token_url, auth_scopes, auth_client_auth). This follows the same convention as http_password vs client_timeout.
The refresh token and client secret should never be committed to source. Reference them from any supported secret store โ environment variables, Kubernetes Secrets, AWS Secrets Manager, HashiCorp Vault, or the OS keychain โ using the ${secrets:...} replacement syntax. For example, with Kubernetes Secrets enabled:
params:
auth_token_url: https://auth.example.com/oauth/token
http_auth_refresh_token: ${secrets:my_refresh_token}
http_auth_client_id: ${secrets:my_client_id}
http_auth_client_secret: ${secrets:my_client_secret}
Public Clients (No Client Secret)โ
For public clients the client_secret is omitted. If you still want to send a client_id for correlation, set http_auth_client_id without http_auth_client_secret:
params:
auth_token_url: https://auth.example.com/oauth/token
http_auth_refresh_token: ${secrets:my_refresh_token}
http_auth_client_id: ${secrets:my_public_client_id}
Sending Credentials in the Body Instead of Basic Authโ
Some token endpoints require client_id/client_secret in the form body rather than via the HTTP Basic header. Set auth_client_auth: body:
params:
auth_token_url: https://auth.example.com/oauth/token
http_auth_refresh_token: ${secrets:my_refresh_token}
http_auth_client_id: ${secrets:my_client_id}
http_auth_client_secret: ${secrets:my_client_secret}
auth_client_auth: body
Local Testingโ
The connector rejects http:// token URLs by default, but allows http://localhost, http://127.0.0.1, and http://[::1] so you can run a mock OAuth server for development:
params:
auth_token_url: http://localhost:8080/oauth/token
http_auth_refresh_token: local-dev-token
Error Behaviorโ
The connector classifies token-endpoint errors to make remediation easy:
- Configuration errors (fail-fast at dataset init, surfaces as
InvalidConfiguration):- Malformed or insecure
auth_token_url - Token endpoint returns
400,401, or403(typically an invalid refresh token, client credentials, or scope) - Token endpoint returns a non-
Bearertoken_type - Incomplete config (e.g.
http_auth_refresh_tokenwithoutauth_token_url, orhttp_auth_client_secretwithouthttp_auth_client_id) - Both OAuth2 auth and an
Authorizationheader inhttp_headersโ remove one
- Malformed or insecure
- Transient / connection errors (surfaces as
UnableToConnect, retried in the background):- Network / TLS failures
5xx,408, or429from the token endpoint- Parse failures on the token response
Error bodies returned by the token endpoint are truncated to 512 bytes and whitespace-collapsed before being surfaced in errors or logs, so hostile or misbehaving endpoints cannot force unbounded buffering or leak multi-line payloads into logs.
Limitationsโ
- JSON APIs only. Structured file formats (csv, parquet, etc.) go through the object-store listing path and are not authenticated by this feature. For those, use a static bearer via
http_headers. - No interactive auth flows. Only the refresh-token grant is supported. Obtain the initial refresh token out-of-band.
- No 401โrefresh-and-retry. Background refresh keeps the token fresh; if a data request 401s, it propagates to the caller.
- One authenticator per dataset. Configure either OAuth2 or an
Authorizationheader inhttp_headers, not both โ the connector rejects the combination at registration time.
Advanced Usageโ
Using Special Metadata Fields with Base URLโ
When using a base URL with special metadata fields, you can dynamically construct different API endpoints:
datasets:
- from: https://api.example.com/v1
name: api_requests
params:
http_headers: 'Content-Type:application/json'
allowed_request_paths: '/users,/data/upload,/api/**'
request_query_filters: enabled
request_body_filters: enabled
With the above configuration, you can query different endpoints by providing values for the special metadata fields:
-- Query a specific user endpoint
SELECT * FROM api_requests
WHERE request_path = '/users/123' AND request_query = 'include=profile,settings';
-- Make a POST request with a body
SELECT * FROM api_requests
WHERE request_path = '/data/upload' AND request_body = '{"name":"example","value":42}';
The connector will construct requests like:
https://api.example.com/v1/users/123?include=profile,settingshttps://api.example.com/v1/data/uploadwith the JSON body
Securing Paths with Glob Patternsโ
The allowed_request_paths parameter supports glob patterns to flexibly and securely match request paths. This provides a flexible way to configure path filtering without listing every possible endpoint.
Pattern Types:
-
Single wildcard (
*): Matches any characters within a single path segment- Example:
/shows/*matches/shows/123and/shows/breaking-bad - Does not match across path separators:
/shows/*does not match/shows/123/episodes
- Example:
-
**Recursive wildcard (
**)**: Matches any number of path segments- Example:
/api/**matches/api/users,/api/v1/users, and/api/v2/posts/123 - Use for flexible API version matching or deep hierarchies
- Example:
-
Character classes (
[...]): Matches one character from a set- Example:
/api/v[0-9]/*matches/api/v1/usersand/api/v2/posts - Example:
/api/v[1-3]/*matches/api/v1/users,/api/v2/posts, and/api/v3/data
- Example:
Examples:
datasets:
- from: https://api.tvmaze.com
name: tv_api
params:
# Match any show ID
allowed_request_paths: '/shows/*'
-- Matches because /shows/82 matches the pattern /shows/*
SELECT * FROM tv_api WHERE request_path = '/shows/82';
datasets:
- from: https://api.example.com
name: versioned_api
params:
# Match all endpoints under any API version
allowed_request_paths: '/api/**'
-- All of these match the pattern /api/**
SELECT * FROM versioned_api WHERE request_path = '/api/users';
SELECT * FROM versioned_api WHERE request_path = '/api/v1/users';
SELECT * FROM versioned_api WHERE request_path = '/api/v2/products/electronics';
datasets:
- from: https://api.example.com
name: specific_versions
params:
# Match only API versions 1-9
allowed_request_paths: '/api/v[0-9]/*'
-- Matches because /api/v1/users matches /api/v[0-9]/*
SELECT * FROM specific_versions WHERE request_path = '/api/v1/users';
-- Does NOT match because v10 has two digits
SELECT * FROM specific_versions WHERE request_path = '/api/v10/users';
Dynamic Filters with Metadata Fieldsโ
The special metadata fields can be combined with dynamic filters to create sophisticated data refresh patterns.
Dynamic API Queries with SQLโ
datasets:
- from: https://api.tvmaze.com
name: tv_shows
params:
http_headers: 'Accept:application/json'
allowed_request_paths: '/search/shows,/shows/*,/shows/*/episodes'
request_query_filters: enabled
Query specific API endpoints dynamically:
-- Search for shows by name
SELECT * FROM tv_shows
WHERE request_path = '/search/shows' AND request_query = 'q=game+of+thrones';
-- Get a specific show by ID (matches /shows/* pattern)
SELECT * FROM tv_shows
WHERE request_path = '/shows/82';
-- Get episodes for a show with filters (matches /shows/*/episodes pattern)
SELECT * FROM tv_shows
WHERE request_path = '/shows/82/episodes' AND request_query = 'season=1';
Incremental Loading with Metadata Fieldsโ
datasets:
- from: https://api.example.com
name: events
params:
allowed_request_paths: '/events,/events/*'
request_query_filters: enabled
acceleration:
enabled: true
refresh_mode: append
refresh_sql: |
SELECT * FROM events
WHERE request_path = '/events'
AND request_query = CONCAT('since=', (SELECT MAX(created_at) FROM events))
This configuration:
- Uses
request_pathto specify the/eventsendpoint - Dynamically constructs the
request_queryparameter using the latest timestamp from existing data - On each refresh, only fetches events created after the last refresh
Paginated Data Loadingโ
For APIs with standard pagination patterns, consider using the built-in pagination feature instead of manual refresh_sql pagination. Built-in pagination handles page traversal automatically with streaming execution.
datasets:
- from: https://api.example.com/v2
name: paginated_data
params:
http_headers: 'Content-Type:application/json'
allowed_request_paths: '/data'
request_query_filters: enabled
acceleration:
enabled: true
refresh_mode: append
refresh_sql: |
SELECT * FROM paginated_data
WHERE request_path = '/data'
AND request_query = CONCAT('page=',
COALESCE((SELECT MAX(page_number) FROM paginated_data) + 1, 1),
'&limit=100')
This incrementally loads pages of data by:
- Tracking the last loaded page number
- Constructing the next page query parameter
- Fetching 100 records per page
POST Request with Dynamic Bodyโ
datasets:
- from: https://api.example.com
name: search_results
params:
http_headers: 'Content-Type:application/json'
allowed_request_paths: '/search'
request_body_filters: enabled
acceleration:
enabled: true
refresh_mode: full
refresh_sql: |
SELECT * FROM search_results
WHERE request_path = '/search'
AND request_body = '{"query": {"match": {"status": "active"}}, "from": 0, "size": 1000}'
This example demonstrates:
- Using
_bodyto send a JSON payload for a POST request - Executing complex search queries against REST APIs
- Fetching results based on structured query syntax
Processing JSON Responsesโ
APIs often return JSON data that requires parsing to extract specific fields. Spice provides JSON functions to process and transform JSON responses directly in SQL queries.
Extracting Fields from JSONโ
datasets:
- from: https://api.tvmaze.com
name: tvmaze
params:
file_format: json
allowed_request_paths: '/shows/*'
Extract specific fields from JSON responses:
-- Extract the show name from a JSON response
SELECT json_get_str(content, 'name') as name
FROM tvmaze
WHERE request_path = '/shows/169';
Working with Nested JSONโ
APIs often return deeply nested JSON structures that require parsing to extract specific fields. Use chained JSON functions to navigate nested objects:
-- Extract nested fields from a show's network information
SELECT
json_get_str(content, 'name') as show_name,
json_get_str(json_get(content, 'network'), 'name') as network_name,
json_get_str(json_get(json_get(content, 'network'), 'country'), 'name') as country,
json_get_str(json_get(json_get(content, 'network'), 'country'), 'code') as country_code
FROM tvmaze
WHERE request_path = '/shows/82';
This demonstrates extracting nested objects step by step:
json_get(content, 'network')extracts the network objectjson_get_str(json_get(content, 'network'), 'name')gets the network name from the nested object- Multiple
json_getcalls can be chained to navigate deeper levels
Extracting Multiple Fieldsโ
-- Parse multiple fields from a TV show API response
SELECT
json_get_str(content, 'name') as show_name,
json_get_str(content, 'type') as show_type,
json_get_str(content, 'language') as language,
json_get_int(content, 'runtime') as runtime_minutes,
json_get_str(content, 'premiered') as premiere_date,
json_get_str(content, 'status') as status
FROM tvmaze
WHERE request_path = '/shows/169';
Processing JSON Arraysโ
-- Extract genres from a JSON array
SELECT
json_get_str(content, 'name') as show_name,
json_get_array(content, 'genres') as genres_array
FROM tvmaze
WHERE request_path = '/shows/82';
For more details on available JSON functions including json_get, json_get_str, json_get_int, json_get_bool, and others, refer to the JSON functions reference.
Refresh SQL with Dynamic Filtersโ
The HTTP connector supports dynamic URL construction through refresh_sql with templated query parameters. This enables incremental data loading by appending filter conditions from the SQL query to the HTTP request URL.
How It Worksโ
When refresh_sql is specified with filters, the connector extracts filter conditions and appends them as query parameters to the URL. This is particularly useful for APIs that support filtering via query parameters.
Time-Based Incremental Loadingโ
datasets:
- from: https://api.example.com/data.csv?start_time={start_time}&end_time={end_time}
name: incremental_data
acceleration:
enabled: true
refresh_mode: append
refresh_sql: |
SELECT * FROM incremental_data
WHERE timestamp > (SELECT MAX(timestamp) FROM incremental_data)
In this example:
- The
{start_time}and{end_time}placeholders in the URL are replaced with values extracted from theWHEREclause inrefresh_sql - Each refresh appends only new data since the last refresh
- The connector automatically maps SQL filter conditions to URL query parameters
Supported Filter Operationsโ
The dynamic filter feature supports the following SQL operations:
- Equality comparisons (
=) - Greater than (
>) - Less than (
<) - Greater than or equal (
>=) - Less than or equal (
<=) - Range queries with
BETWEEN
Notesโ
- URL parameters must match filter column names in the
refresh_sql - Only filters that can be pushed down to the HTTP source will be applied to the URL
- Complex filters may not be supported for URL templating
Limitationsโ
Security Constraintsโ
For security and to prevent unauthorized access, the HTTP connector enforces the following constraints on special metadata fields:
Request Path Limitationsโ
- Explicit Allow-List Required: The
request_pathfield cannot be used without configuringallowed_request_paths - Path Pattern Format: All patterns in
allowed_request_pathsmust:- Start with
/ - Not contain
..path traversal segments - Not exceed 2048 characters in length
- Start with
- Glob Pattern Matching: Query filters are matched against glob patterns in the
allowed_request_pathslist using:*matches a single path segment (e.g.,/shows/*matches/shows/123but not/shows/123/episodes)**matches multiple path segments recursively (e.g.,/api/**matches/api/v1/usersand/api/v2/posts/123)[...]character classes (e.g.,/api/v[0-9]/*matches/api/v1/usersbut not/api/v10/users)
- Empty Paths: Empty
request_pathfilters are rejected
Example error when allowed_request_paths is not configured:
request_path filters are disabled for this dataset. Configure allowed_request_paths to enable them.
Request Query Limitationsโ
- Explicit Enable Required: The
request_queryfield requiresrequest_query_filters: enabled - Length Limit: Query strings are limited to 1024 characters by default (configurable up to 4096 via
max_request_query_length) - Control Characters: Query strings cannot contain control characters
- Leading Question Mark: The connector automatically strips leading
?if present
Example error when query filters are not enabled:
request_query filters are disabled for this dataset. Enable request_query_filters to use them.
Request Body Limitationsโ
- Explicit Enable Required: The
request_bodyfield requiresrequest_body_filters: enabled - Size Limit: Request bodies are limited to 16 KiB (16,384 bytes) by default (configurable up to 64 KiB via
max_request_body_bytes) - POST Method: When a
request_bodyfilter is present, the HTTP method automatically changes to POST
Example error when body filters are not enabled:
request_body filters are disabled for this dataset. Enable request_body_filters to use them.
Configuration Requirementsโ
To use the special metadata fields (request_path, request_query, request_body), you must:
- For
request_path: Configureallowed_request_pathswith a comma-separated list of allowed path patterns (supports glob patterns) - For
request_query: Setrequest_query_filters: enabledin params - For
request_body: Setrequest_body_filters: enabledin params
Example minimal configuration for all three fields:
datasets:
- from: https://api.example.com
name: my_api
params:
allowed_request_paths: '/users,/posts,/comments,/api/**'
request_query_filters: enabled
request_body_filters: enabled
Performance Considerationsโ
- Connection Pooling: The connector maintains up to 10 idle connections per host by default
- Retry Overhead: With the default 3 retries and Fibonacci backoff, failed requests may take several seconds before returning an error
- Cache Behavior: HTTP responses are cached based on the combination of path, query, and body parameters
Secretsโ
Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the secret stores documentation. Additionally, learn how to use referenced secrets in component parameters by visiting the using referenced secrets guide.
