Troubleshooting your FeatureHub install

While FeatureHub is intended to be as robust as possible, it can encounter issues and this section is intended to document a few of these. We will continue to add documentation here as we help users resolve issues.

NATS goes down completely

You should be running at least 3 NATS nodes, and generally NATS communicates those connections back to the client. The nats.urls configuration option lets you specify a list of NATS servers, so the application if it loses connection to one will connect and continue on with another.

But it could go down because of a network failure or resource starvation issue in your cluster.

The default NATS install does not run the JetStream option, so it is possible that your cache (Dacha1 or Dacha2) can get outdated. If this happens, then you need to ensure you refresh your caches.

Solution

  • use more than one NATS instance, and use the official NATS installation (e.g. helm chart)

  • if your cluster suffers a lot of jitter, consider using JetStream to make your messages extra reliable

  • ensure your clients point to all of the nats instances running with nats.urls if possible

Caches are poisoned by some event

Various things can cause cache poisoning, they are normally related to a strain on resources or some portion of the cluster going down - something outside the architectural control of FeatureHub.

Solution Dacha1

In Dacha1,

  • System Administrators can trigger a full cache restore using the UI from 1.6.1 onwards. You can refresh applications, portfolios or the whole system.

  • If you are using an earlier version, you can trigger individual environments to trigger by manually associating and disassciating a service account with them. This association process causes both the attached environments and service accounts to fully refresh.

  • Alternatively you can scale your Dacha1 instances to 0 but this will cause an outage.

Further, in Dacha1 (as of 1.6.1+) you can ask the cache what it contains using a simple API call.

  • 'GET /dacha1/cache' - this will give you all of the environments and service accounts that the server holds, along with their counts.

  • 'GET /dacha1/cache/apiKeys' - this will give you a list of all the API keys combinations that this server is capable of serving.

Solution Dacha2

In Dacha2, the instances are completely isolated from each other and cache only what they have received or have requested. When they start, they are empty and only fill as new updates come through or if API Keys are requested. If you are in this situation with Dacha2, get a list of the running instances and replace them one by one, there will be no outage. Dacha2 is the default for Helm installs as of the 4.x chart.

I am getting a lot of 404s on my X service!

There is no specific solution to this, its important to understand where the 404s are coming from. Not Found errors (404s) occurring on Edge are normally script-kiddies or other scrapers trying to get into your Edge servers. On MR, the API occasionally uses them to determine if something is available, but they are fairly rare and are also more likely to be unwanted activity.

Not Found errors on Dacha are the result of a request for an API key that doesn’t exist, and this is unusual - your clients are expected to be only requesting valid API keys.

To help diagnose what is going wrong, you need to turn on REST trace logging, and this is normally turned off as it is quite noisy.

Solution

Assuming you are using the Helm charts, there is a section that looks something like this:

  extraCommonConfigFiles:
    - configMapSuffix: log4j2-xml
      fileName: log4j2.xml
      content: |-
        <Configuration packages="cd.connect.logging" monitorInterval="30" verbose="true">
           <Appenders>
             <Console name="STDOUT" target="SYSTEM_OUT">
               <ConnectJsonLayout/>
             </Console>
           </Appenders>

           <Loggers>
             <AsyncLogger name="io.featurehub" level="debug"/>
             <!--
             <AsyncLogger name="io.ebean.SQL" level="trace"/>
             <AsyncLogger name="io.ebean.TXN" level="trace"/>
             <AsyncLogger name="io.ebean.SUM" level="trace"/>
             <AsyncLogger name="io.ebean.DDL" level="trace"/>
             <AsyncLogger name="io.ebean.cache.QUERY" level="trace"/>
             <AsyncLogger name="io.ebean.cache.BEAN" level="trace"/>
             <AsyncLogger name="io.ebean.cache.COLL" level="trace"/>
             <AsyncLogger name="io.ebean.cache.NATKEY" level="trace"/>

             <AsyncLogger name="jersey-logging" level="trace"/>
             <AsyncLogger name="io.featurehub.db" level="trace"/>
             -->
             <AsyncLogger name="io.featurehub.edge.features" level="debug"/>

The section in between the ←- and -→ is a comment, and it turns on all kinds of detailed logging, but the one we are interested in for REST logging is this one:

<AsyncLogger name="jersey-logging" level="trace"/>

It needs to move outside the comments. Once you do that, all of the details (except auth tokens) will start flowing into your logs.

I need to do X to the database

There are very few tables in FeatureHub, 20 belonging to FeatureHub itself. The Java code that represents them is here. There are some basic tables you would expect:

  • fh_portfolio - stores portfolios, and is essentially the top level table along with

  • fh_person - stores people. fh_person_group_link stores the link between people and groups.

Next we have those that belong to portfolio:

  • fh_application - stores applications (portfolios is the parent)

  • fh_group - stores portfolio groups

  • fh_service_account - stores portfolio service accounts

Further down we have:

  • fh_fv_version - which stores the history of the feature values as they change

Belonging to the application comes next:

  • fh_environment. Others such as fh_acl (for group/application permissions), fh_service_account_env (service

  • fh_app_feature - this stores the feature itself

  • fh_env_feature_strategy - this stores the feature value (feature intersection with environment)

  • fh_userstate - stores app customisations like which environments are showing

  • fh_webhook - stores the results of requested webhooks

Operational tables:

  • fh_login - keeps track of issued tokens. If you wanted to kick everyone off and make them login again, delete everything here for example.

And misc tables:

  • db_migration - to keep track of database migrations

  • fh_after_mig_job - tracks jobs that need to run once after a migration occurs because of data structural changes

  • fh_cache - obsolete

  • fh_organisation - just stores your Organisation name.

Exists in API but not currently used:

  • fh_app_strategy - these are shared strategies for an application and aren’t currently surfaced in the UI

  • fh_strat_for_feature - stores the values a shared strategy has against a feature in an environment

Example

If for example you wanted to swap an application from one portfolio to the next, then you need to carefully cut the ties of that application with its existing portfolio. That would mean cutting ties with the groups (the ones you have created as well as modifying the superuser group who can create feature), as well as any service accounts.

Here we will swap the application "app1" from portfolio "Second" to "First".

# find the IDs of the portfolios
MariaDB [featurehub]> select id, name from fh_portfolio;
+--------------------------------------+-----------------+
| id                                   | name            |
+--------------------------------------+-----------------+
| b5d3d5d9-6830-4176-86b1-a64d9c590c0f | First Portfolio |
| e4687958-5751-430f-9f9e-468583a62c76 | Fourth          |
| 621f36fc-6491-4b9e-9a70-47192ce3cb85 | Second          |
| 3cc2a2fd-5469-42ab-a8a6-7a462189ff92 | Third           |
+--------------------------------------+-----------------+
4 rows in set (0.001 sec)

# find the app in portfolio "Second"
MariaDB [featurehub]> select id, name from fh_application where fk_portfolio_id = '621f36fc-6491-4b9e-9a70-47192ce3cb85';
+--------------------------------------+------+
| id                                   | name |
+--------------------------------------+------+
| a0ddda31-2cf8-4c12-9a04-15674d731bd5 | app1 |
+--------------------------------------+------+
1 row in set (0.000 sec)

MariaDB [featurehub]> desc fh_acl;
+----------------+--------------+------+-----+---------+-------+
| Field          | Type         | Null | Key | Default | Extra |
+----------------+--------------+------+-----+---------+-------+
| id             | varchar(40)  | NO   | PRI | NULL    |       |
| environment_id | varchar(40)  | YES  | MUL | NULL    |       |
| application_id | varchar(40)  | YES  | MUL | NULL    |       |
| group_id       | varchar(40)  | YES  | MUL | NULL    |       |
| roles          | varchar(255) | YES  |     | NULL    |       |
| version        | bigint(20)   | NO   |     | NULL    |       |
| when_updated   | datetime(6)  | NO   |     | NULL    |       |
| when_created   | datetime(6)  | NO   |     | NULL    |       |
+----------------+--------------+------+-----+---------+-------+
8 rows in set (0.001 sec)

# find the permission ACLs in the app. One is the superuser and attached to the superuser group for that portfolio, and the second is specifically a permission to an environment for that portfolio's group
MariaDB [featurehub]> select * from fh_acl where application_id = 'a0ddda31-2cf8-4c12-9a04-15674d731bd5' or environment_id in (select id from fh_environment where fk_app_id='a0ddda31-2cf8-4c12-9a04-15674d731bd5');
+--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------+
| id                                   | environment_id                       | application_id                       | group_id                             | roles        |
+--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------+
| 7c8d400c-2daf-4268-acf8-b59cc1252e0b | NULL                                 | a0ddda31-2cf8-4c12-9a04-15674d731bd5 | 7a789c02-8964-4ff3-8d6a-27656e5a4aed | FEATURE_EDIT |
| f5ce52eb-92e9-47ad-bcbf-15cc65841a25 | 98f0a596-3224-46f3-8800-ee98dbafbbd9 | NULL                                 | d0009d21-856e-47db-910b-c363d4fe1011 | READ         |
+--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------+
2 rows in set (0.000 sec)

MariaDB [featurehub]> desc fh_service_account_env;
+-----------------------+--------------+------+-----+---------+-------+
| Field                 | Type         | Null | Key | Default | Extra |
+-----------------------+--------------+------+-----+---------+-------+
| id                    | varchar(40)  | NO   | PRI | NULL    |       |
| fk_environment_id     | varchar(40)  | NO   | MUL | NULL    |       |
| permissions           | varchar(200) | YES  |     | NULL    |       |
| fk_service_account_id | varchar(40)  | NO   | MUL | NULL    |       |
| when_updated          | datetime(6)  | NO   |     | NULL    |       |
| when_created          | datetime(6)  | NO   |     | NULL    |       |
| version               | bigint(20)   | NO   |     | NULL    |       |
+-----------------------+--------------+------+-----+---------+-------+
7 rows in set (0.002 sec)

# find the service account connections to this app
MariaDB [featurehub]> select * from fh_service_account_env where fk_environment_id in (select id from fh_environment where fk_app_id='a0ddda31-2cf8-4c12-9a04-15674d731bd5');
+--------------------------------------+--------------------------------------+-------------------------------+--------------------------------------+
| id                                   | fk_environment_id                    | permissions                   | fk_service_account_id                |
+--------------------------------------+--------------------------------------+-------------------------------+--------------------------------------+
| 8e114d6d-442c-4d5a-b5b2-a01588bea843 | 98f0a596-3224-46f3-8800-ee98dbafbbd9 | READ,UNLOCK,LOCK,CHANGE_VALUE | 034b6682-4059-4c7d-982d-d01a5ca11a2e |
+--------------------------------------+--------------------------------------+-------------------------------+--------------------------------------+
1 row in set (0.001 sec)

# now delete those service account permissions
MariaDB [featurehub]> delete from fh_service_account_env where fk_environment_id in (select id from fh_environment where fk_app_id='a0ddda31-2cf8-4c12-9a04-15674d731bd5');
Query OK, 1 row affected (0.002 sec)

# get rid of the ACL that is specific to the environment as the group belongs to the portfolio
MariaDB [featurehub]> delete from fh_acl where id = 'f5ce52eb-92e9-47ad-bcbf-15cc65841a25';
Query OK, 1 row affected (0.002 sec)


# find the group for the superusers in the "First" portfolio
MariaDB [featurehub]> select id, group_name from fh_group where fk_portfolio_id = 'b5d3d5d9-6830-4176-86b1-a64d9c590c0f';
+--------------------------------------+--------------------------------+
| id                                   | group_name                     |
+--------------------------------------+--------------------------------+
| e4fb3ba6-a5f0-46f8-91a5-b317c9a26887 | First Portfolio Administrators |
+--------------------------------------+--------------------------------+
1 row in set (0.000 sec)

# confirm we have only the one permission left and it is the wrong group id
MariaDB [featurehub]> select * from fh_acl where application_id = 'a0ddda31-2cf8-4c12-9a04-15674d731bd5' or environment_id in (select id from fh_environment where fk_app_id='a0ddda31-2cf8-4c12-9a04-15674d731bd5');
+--------------------------------------+----------------+--------------------------------------+--------------------------------------+--------------+---------+----------------------------+----------------------------+
| id                                   | environment_id | application_id                       | group_id                             | roles        | version | when_updated               | when_created               |
+--------------------------------------+----------------+--------------------------------------+--------------------------------------+--------------+---------+----------------------------+----------------------------+
| 7c8d400c-2daf-4268-acf8-b59cc1252e0b | NULL           | a0ddda31-2cf8-4c12-9a04-15674d731bd5 | 7a789c02-8964-4ff3-8d6a-27656e5a4aed | FEATURE_EDIT |       1 | 2023-06-07 05:51:58.828000 | 2023-06-07 05:51:58.828000 |
+--------------------------------------+----------------+--------------------------------------+--------------------------------------+--------------+---------+----------------------------+----------------------------+
1 row in set (0.001 sec)

# swap the group id to the right group id (the one in portfolio "First")
MariaDB [featurehub]> update fh_acl set group_id = 'e4fb3ba6-a5f0-46f8-91a5-b317c9a26887' where id = '7c8d400c-2daf-4268-acf8-b59cc1252e0b';
Query OK, 1 row affected (0.002 sec)
Rows matched: 1  Changed: 1  Warnings: 0

# tell the application it has a new portfolio
MariaDB [featurehub]> update fh_application set fk_portfolio_id = 'b5d3d5d9-6830-4176-86b1-a64d9c590c0f' where id = 'a0ddda31-2cf8-4c12-9a04-15674d731bd5';
Query OK, 1 row affected (0.001 sec)
Rows matched: 1  Changed: 1  Warnings: 0