APIM GWs are publishing analytics event to DAS receiver for each and every request GW serve. Also when DAS receives an event it persists the raw event after do some real-time analysis. And also generate some new event based on the original events and again generate some new tables. These data are processed by batch analytics scripts and summarised data again persist on both DAS and RDBMS.
As you can see these data are aggregated by DAS and it’s storage getting filled with data over the time. But when the system runs for a long time since these accumulated data still in the system, it affects the performance, storage limitation and unnecessary filtering. This issue is not just happening for DAS, but for RDBMS as well.
Below is the workaround that can be taken into account overcome this issue. Some of these topics are already found in the APIM. But there is some gap in the system and need to be enhanced. And also when extending the analytics system, these steps need to be taken in order to get better performance for the long-running system.
Incremental analysis is a feature in WSO2 DAS and APIM analytics to optimize the performance in a long-running system which can have big data. A key idea of this feature is it ignores the already processed data next time a spark job run. In Apim, only a few scenarios have used these features. Because not all the events stream received the same amount of traffic. Thus request, response and execution time event stream data are incrementally processed.
If you have an idea to improve the performance more, it is time to think about the other tables which used spark summarization(Like throttle summary etc) to apply the incremental processing to optimize the performance.
Data purging is another key factor that affects the performance. But if you enable the incremental processing, it is not affected much to the system even data purging did not enable. But eventually, you need to consider data plugin of the analytics table, since it grows your system space, RDBMS space, and RDBMS performance.
Currently, APIM doesn't provide data purging configuration for tables except for few alert related table. The reason behind this is, User related data purging is a user’s task and APIM can't decide it. Another thing is you can't just purge tables as you wish because it gonna affect to the other tables and cause loss of data.
In the APIM analytics table list, there are 3 sets of tables according to my view. Those are,
From that sets, Batch processing related table is the most critical tables in the system. Which data growth is higher than others normally. Also, you just can't purge records in this table directly. As you may know, APIM batch analytics data are summarized daily basis. So less than 24h old data are not yet completely processed and cannot be removed from the tables. Thus we need to keep those data when we apply the purging on these tables set. In the purging configuration, we better mention the time interval of a table to be executed as 2 days which include 1 additional day in order to prevent data loss due to error margin.
These table set include alert related data.
In this set, there is two set of tables. Those are streams table and generated alert tables. Streams tables are persisted data of the streams those are generated from the original analytics events. Those data are used to generate alert and then generated alert is saved in the alert tables. Data purging is also can apply on these tables and it does not affect any functionality. But which clean the system alert history. Recommend to keep alert data at least 30 days of history and better to clean the data once it’s expired.
This table holds the analytics related configuration. So in this table contain the alert subscription-related information. It includes per user type of alert subscribed and email configuration of that subscriber. Hence this table is not allowed to purge. If it purged, all the subscribed users and their subscribed alert may reset. Since this is the only place those config data are available, not will be able to roll it back if it purged.
Following summary tables are used to save summarized or processed data in the DAS before it saved on external RDBMS. But these tables are deprecated from AM 2.1.0. So after the AM 2.1.0 these tables may not created or populated.
If this tables are available in your system, you can purge them as well. This is the original copy of the RDBMS summary tables and after the data are moved to RDBMS purging can be initiated. Like the raw event table, condition for this table is also purge 2 days old data. Because these tables are update every time the batch analytics job is execute within the day. Even this summary tables are purged from the system, it not affect to the RDBMS summary tables and it’s data may not disappear or not affect to the previous summary data.
These tables are also summarized data of the logs analytics events. These tables are populated only if log analytics is enabled. These data are only stored in the Analyzer tables and APIM dashboard fetch data over Analytics REST API. These tables are allowed to purge as per the previous summary tables
Data archiving can be performed on both Analytics server databases and APIM statistics database. Objective is to reduce the number of records in the database in order to optimize the system by freeing up disk space and reduce time taken to execute sql queries on data base on both analytics and apim servers.
Analytics server data can be remove from the system by enabling data purging configuration using Analytics server purging feature. But data archiving is required if the data need to be backup for further analysis and if there is requirement to restored.
In the statistics data which is summary of the original analytics events can also to be archived. Archiving on summary data may be required if system need to be tune up. Also raw data are expected to be purged, summary data should be available for further analysis. Another use case for data archiving from the statistics database is, sometime summary data are process further for higher time span(ex: per day records to per year summary) and store in the same DB or somewhere else. So the processed data or per day records can be clean from the system, since new representation of the same data is present.
So this is optional task in order to performance improvement or system storage clean. Way to perform is depend on the underlying database.