# Understanding Reverse ETL integrations

## How it works[](#how-it-works)

Our data warehouse and database “reverse ETL” (Extract, Transform, Load) integrations extract data from a data warehouse so you can take advantage of your data in Customer.io and other services downstream. These integrations help you leverage your big data storage in Customer.io. For example, you might write a query representing `track` events so you can sync records from Snowflake to Mixpanel.

When you set up a reverse ETL integration, you’ll set up one or more *syncs*. This is the kind of [API call](/integrations/data-in/source-spec/incoming-data/) you want to transform your data into (like `track`, `identify`, etc) and the query that returns the data you want to use in Customer.io. Each row returned from the query represents an API method.

You’ll have to create a *sync* for each kind of method or data you want to use in Customer.io or other integrations. We run your query on an interval, and expose a `last_sync_time` value that you can use to make sure that you only sync data that changed since the previous sync interval.

## Add a reverse ETL or database integration[](#add-a-reverse-etl-source)

Before you add an integration, we suggest that you set up a service account or create a user account specifically for Customer.io with read-only access to the tables you want to sync. This helps you ensure the security of your data, and gives you a way to revoke access if you need to.

**Your database or storage bucket must allow connections from the following IP addresses**. If our IP addresses are blocked, we won’t be able to connect to your database.

Account region

IP Addresses

US

34.29.50.4, 35.222.130.209

EU

34.22.168.136, 34.78.194.61

1.  Go to *Data & Integrations > Integrations* and make sure you’re in the **Directory** tab.
2.  Find and select your *Data Warehouse* or *Database*. Make sure it says *Data In*; we offer some data-out integrations for these same data warehouses and databases!
3.  Give your database a name and connect your database to Customer.io
    
    [![connect your database](https://docs.customer.io/images/cdp-source-mysql.png)](#a4afe0c2ee9eb7045f276b9efd13706c-lightbox)
    
4.  Set up a **Sync**. A sync is the type of data (`identify`, `track`, etc) you want to import from your database and click **Next: Define Query**. You can set up syncs for each type of data you want to import.
    1.  Provide a *Name* and *Description* for the sync. This helps you understand the sync at a glance when you look at your integration’s *Overview* later.
    2.  Select the [type of data](/integrations/data-in/source-events) you want to import.
    3.  Set the *Sync Frequency*, indicating how often you want to query your database for new data. You should set the frequency such that sync operations don’t overlap. [Learn more about sync frequency](#sync-frequency).
    4.  Select when you want to start the sync: whether you want to begin importing data immediately, or schedule the sync to start at a later date.
        
        [![add a sync and determine the data you want to capture](https://docs.customer.io/images/cdp-source-etl-add-sync.png)](#171c1a0259c4c9ff8e960d5e24493331-lightbox)
        
5.  Enter the query that selects the data you want to import. Click **Run Query** to preview results and make sure that your query selects the right information.
    
    [![set up your query to sync the right information](https://docs.customer.io/images/cdp-etl-source-define-query.png)](#22e13889f0723ebb6fc2b6d36be3bdfa-lightbox)
    
6.  Click **Enable** to enable your sync.

Now you can connect your database to places outside of Customer.io.

### SSH Tunneling[](#ssh-tunneling)

For added security, we support SSH tunnelling for our [MySQL](/integrations/data-in/connections/reverse-etl/mysql), [PostgreSQL](/integrations/data-in/connections/reverse-etl/postgresql), and [Microsoft SQL Server](/integrations/data-in/connections/reverse-etl/microsoft-sql) integrations.

To set up your tunnel, you’ll generate a public key when you add your database:

1.  Enable SSH tunneling.
2.  Provide the hostname and port of your SSH server.
3.  Click **Generate Public Key** to generate a public key for your SSH server.
4.  Copy the public key to your SSH server and click **Connect**.

[![connect your database](https://docs.customer.io/images/cdp-source-reverseetl-ssh.png)](#7a68e19a75952eba73fbbc3651b48ae1-lightbox)

#### Rotating SSH Keys[](#rotating-ssh-keys)

You can return to your integration and generate a new public key for your SSH server at any time.

1.  Go to *Data & Integrations > Integrations* and select your database in the **Connections** tab.
2.  Go to the **Databases** tab and click the settings icon and go to **Edit**.
3.  Click **Generate Public Key** to generate a new public key and copy the key to your SSH server to continue connecting to your database.

We won’t save or use the key until you click **Save Database**, and we won’t let you save your changes until your database connection is active with the new key.

[![Generate a new SSH key to rotate out your old public key](https://docs.customer.io/images/cdp-source-reverseetl-ssh-rotate.png)](#4bce06f01972864bd2d1cb5a34ac70e0-lightbox)

## Add additional sync models[](#add-additional-sync-models)

A *sync* defines the type of call you want to make to Customer.io. Your query that returns the data you want to capture on every interval. Each row returned from the query represents an individual API call.

After you [set up your integration](#add-a-reverse-etl-source), you can add subsequent syncs to extract and transform data into additional calls by going to the **Sync** tab for your integration and clicking **Add sync**.

[![add a sync](https://docs.customer.io/images/cdp-source-etl-add-sync.png)](#171c1a0259c4c9ff8e960d5e24493331-lightbox)

### Last sync time: sync only the records you want[](#last-sync-time-sync-only-the-records-you-want)

We expose a `last_sync_time` variable that you should use in your query to sync only the records that have changed since the last sync. This helps you avoid syncing the same records over and over again—which can cause each query to take longer, risk sending duplicate information into Customer.io or other places.

We strongly recommend that you index a column in your database representing the date-time each row was last-updated. When you write your query, you should add a `WHERE` clause comparing your “last updated” column to the `{{last_sync_time}}`.

The last sync time is a Unix timestamp representing the date-time when the previous sync started. The first time you sync, `{{last_sync_time}}` will be `0`, so you’ll sync all records. After that, you’ll only sync records that have changed since the last sync.

```mysql
SELECT id AS userId, email, first_name, created AS created_at
FROM my_people
WHERE UNIX_TIMESTAMP(last_updated) > {{last_sync_time}}
```

### Update existing profiles without creating new ones[](#update-existing-profiles-without-creating-new-ones)

By default, `identify` requests create new people in Customer.io for any `userId` that doesn’t already exist. If you want to update existing profiles without adding new ones, you can use the `_update` parameter in your query.

```sql
SELECT
  userId,
  email,
  someAttribute,
  true AS '_update'
FROM
  peopleTable
WHERE
  updatedAt > {{last_sync_time}}
```

### Reserved properties and traits[](#reserved-properties-and-traits)

We automatically treat values that aren’t reserved by our API calls as `traits` or event `properties` depending on the type of call you send, so you don’t have to force your query to return data exactly in the shape our API expects. For example, if you want to set the `email` trait, you can return `user.email AS email`.

**For `identify` or `group` data**, any property other than the following goes in the `traits` object: `userId`, `anonymousId`, `groupId`, `integrations`, `messageId`, `timestamp`, `context`.

**For `track`, `page`, and `screen` data**, any property other than the following goes in the `properties` object: `userId`, `anonymousId`, `event`, `integrations`, `messageId`, `timestamp`, `context`.

These fields correspond to the common fields that we reserve in each API call, and extend to children of these reserved fields. For example, `context.ip` is also reserved. [Learn more about common fields in our APIs](/integrations/data-in/source-spec/common-fields).

## Nested traits, properties, and relationship attributes[](#nested-traits-properties-and-relationship-attributes)

When you set up a sync, you’re essentially propagating the result of a `SELECT` query to the [Pipelines API](/integrations/api/cdp/). Some properties or attributes require you to nest data. Or you might simply want to nest properties based on how you expect them in Customer.io.

Some reverse ETL integrations support dot notation, like `SELECT product AS 'cart.product_name'`. But some, like BigQuery, don’t. You can support nested items in a platform-agnostic syntax using `JSON_OBJECT`. This lets you build an object from items in your query by alternating keys and values in order, e.g. `key, value, key, value`.

For example, if you relate someone to a company (a [custom objectAn object is a non-person entity that you can associate with one or more people—like a company, account, or online course.](/journeys/objects/)), and you want to set attributes for the relationship, you’d nest them inside a `traits.relationshipAttributes` object. But BigQuery doesn’t support dot notation, so here’s how you’d do that:

```sql
SELECT 
   id as groupId,
   name,
   json_object(
      'job_title', job_title,
      'primary_contact', primary_contact
   ) as relationshipAttributes
```

You may need to do the same for other objects, like device context, relationship attributes, and so on. See our [API documentation](/integrations/api/cdp/) for more information about the structure of our payloads and properties you might need to pass as objects.

### Sync Frequency[](#sync-frequency)

You can sync data as often as every minute. However, we recommend that you set your sync frequency such that sync operations don’t overlap. If you schedule syncs such that a sync operation is scheduled to start while the previous operation is still we’ll skip the next sync operation.

## Enable and disable syncs[](#enable-and-disable-syncs)

In your integration, you can go into the **Syncs** tab to enable and disable syncs. You can also delete syncs from this tab.

When you **disable** a sync, we stop running the associated query and sending the type of data to the places you’ve connected your integration to—including Customer.io. We don’t delete the sync, so you can re-enable it later.

When you **enable** a sync, we’ll resume the sync. The next sync interval will send all data that has changed since the last sync.

[![enable or disable a sync](https://docs.customer.io/images/cdp-source-sync-enable.png)](#d5fdde2243893842fdf2692a4f9b6f80-lightbox)

## Errors and sync history[](#errors-and-sync-history)

In the **Imports** tab for your integration, we display a row for each sync interval showing the number of successful and unsuccessful rows—where each row represents an individual operation (like an `identify` call or a `track` call). You can click a sync to see errors. Typically errors occur when a row is either missing data or contains data that doesn’t map to the appropriate source call type.

If a sync has errors, the row also includes a *Download* button, so you can download a list of errors.

[![View your sync history](https://docs.customer.io/images/cdp-etl-source-imports.png)](#566029362a83457604c7cbeb73483db6-lightbox)

You can also go to the **Syncs** tab to see the status of your syncs, how often they run, and their last run time. You might use this tab to see if your syncs complete on time or if you need to decrease the frequency of syncs.

The **Data In** tab also shows successful source calls. You can use this to see exactly how your rows map to source calls.

[![View your sync history](https://docs.customer.io/images/cdp-etl-source-data-in.png)](#4ac434fb13e40ecbb524e8336f1daa44-lightbox)

### Missing rows[](#missing-rows)

Missing rows are often due to the value you compare to [`last_sync_time`](#last-sync-time-sync-only-the-records-you-want).

```mysql
SELECT id AS userId, email, first_name, created AS created_at, message_id AS messageId
FROM my_people
WHERE UNIX_TIMESTAMP(last_updated) > {{last_sync_time}}
```

If you can, update `last_updated` based on when your syncs start. If you can’t do this, you’ll need to change your query. For example, you could query for results beginning 5 minutes before the previous sync to give yourself a buffer and make sure that your query doesn’t miss results you want to sync:

```mysql
WHERE UNIX_SECONDS(last_updated) >= {{last_sync_time}} - 300
```

If you query for results beginning some amount of time before the previous sync, you should also update your query to capture (or generate a) `messageId`. This is a string field that we use to prevent duplicate entries. We accept the first instance of any given event with a given `messageId` and ignore any duplicate events with the same `messageId`. If your “time-buffered query” includes events with the same `messageId` as an event in a previous sync, we’ll ignore the duplicates.

If you [backdate events](/integrations/data-in/importing-old-data/#advanced-backfilling-events), you’ll need to deduplicate them before you send them to Customer.io. We deduplicate the `messageId` for 12 hours after we receive the operation—not the timestamp on the event itself.

## Sync query limits[](#sync-query-limits)

Name

Details

Limit

Maximum query length

The maximum length allowed for any query.

131,072 characters

`userId` column name length

The maximum length allowed for the userId column name.

191 characters

timestamp column name length

The maximum length for the timestamp column name.

191 characters

Sync frequency

The shortest possible duration between syncs.

1 minutes

## Result limits[](#result-limits)

A sync cannot return more than 40 million records and 300 total columns.

Name

Details

Limit

Record count

The maximum number of records a single sync will process. Note: This is the number of records extracted from the warehouse not the limit for the number of records loaded to the integration (for example, new/update/deleted).

30 million records

Column count

The maximum number of columns a single sync will process.

512 columns

Column name length

The maximum length of a record column.

128 characters

Record JSON size

The maximum size for a record when converted to JSON (some of this limit is used by Customer.io).

512 KiB

Column JSON size

The maximum size of any single column value.

128 KiB