Datasets best practices | Tealium Insights

The following best practices help ensure efficient dataset performance and data quality in Tealium Insights.

Optimize dataset structure

Optimizing the structure of your dataset can significantly improve query performance and improve load times.

Use multiple smaller datasets

Instead of using one large dataset for a dashboard, break it down into multiple smaller datasets. Using smaller datasets reduces the data loaded at once, making queries faster and the dashboard more responsive.

Reduce the number of columns

When creating a dataset, all columns are loaded into SPICE (Super-fast Parallel In-memory Calculation Engine) by default. Including more columns than is necessary can slow down query performance, especially with large datasets.

For example, in a dataset with 20 million rows:

1 column takes approximately 2.63 minutes to load.
4 columns take approximately 4.61 minutes to load.
150 columns take approximately 19.49 minutes to load.

In the dataset view, include only the columns needed for analysis or dashboards, and remove any columns that aren’t needed. This reduces load times and improves responsiveness.

Use date columns to filter rows

For large datasets, using date columns to filter rows can significantly improve database performance.

For event data, filter by the event - time column.
For visits/visitors data, filter by the updated column.

Filtering by these date columns allows Insights to take advantage of sorting optimizations, making it faster to retrieve data.

Aggregate and process data using SQL

Use custom SQL

Instead of loading entire tables or schemas, use the custom SQL option to pre-aggregate or filter data before loading into SPICE. This reduces dataset size and improves performance.

For example, to count event types by day, use a SQL query like this:

SELECT 
    COUNT(*) AS count, 
    TRUNC("event - time") AS "period", 
    "event - udo - tealium_event" 
FROM ${db_schema}.events_view__all_events__all_events 
GROUP BY "period", "event - udo - tealium_event";

Adapt this approach for other aggregations or analyses you need to perform, as it reduces the data processed in SPICE.

Avoid using `SELECT *` in queries

Using SELECT * queries can slow down performance and make queries harder to read. Instead, select only the columns you need.

Avoid selecting all columns:

SELECT * FROM employees;

Select only the columns you need:

SELECT employee_id, first_name, last_name
FROM employees;

If you’re exploring a new table and need to see all columns, use a LIMIT clause to reduce the number of rows displayed:

SELECT * FROM employees
LIMIT 10;

Avoid subqueries

While subqueries offer flexible query organization, they can slow down performance. When possible, use joins or temporary tables instead.

Avoid using a subquery:

SELECT employee_id, first_name, last_name
FROM employees
WHERE department_id IN (SELECT id FROM departments WHERE name = 'Sales');

Use a join to query data from two tables:

SELECT e.employee_id, e.first_name, e.last_name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.id
WHERE d.name = 'Sales';

Configure incremental refreshes

To maintain dataset accuracy and reduce refresh times, configure an incremental refresh when creating a refresh schedule. Incremental refreshes are ideal for large, frequently updated datasets because it processes only new or updated data, unlike full refreshes, which reload the entire dataset. For best results, we recommend a 2-day window size, though you can adjust this based on your requirements.