Wednesday, 27 December 2017

Data quality metrics for Performance Measurement

Introduction


The data quality can be measured through metrics which in turn helps to identify the issue and helps the performance engineer to create the data or modify the data to adhere to the quality. Data quality depends on type of application, type of tables/views used in the application etc. If the data quality metrics are not adhered, the performance measurement gets compromised.

SAP applications are used by many companies. With the availability of SAP HANA platform, the business applications that are developed at SAP has undergone paradigm shift. The complex operations are push down to the database. The rule of thumb is therefore to get the best performance by doing as much as you can in the database. Applications that are developed on SAP HANA uses new data modeling infrastructure known as Core Data Services (CDS). With CDS views, data models are defined and consumed on the database rather than on the application server. The application developer can now use various built-in functions, extensions etc.

Performance Test Process

The Fiori application that are developed basically try to make most of the SAP HANA platform. In S/4Hana applications, whenever there is a request made from the Fiori application for retrieving the information, hits the CDS views. The SQL query with CDS view name in FROM clause along with the filters are passed on to the HANA database. The query gets executed in the HANA database and returns the result set to the Fiori UI.

To measure the performance of the Fiori application against single user, usually performance test starts with executing the dry runs and then measure the performance of the application subsequently. The measured performance is then compared with the thresholds that are defined and violation are identified.

Data Quality


For measuring the performance of the application, the data quality plays a crucial role. The test system wherein the performance measurement needs to be taken, should have adequate quality of data. There are CDS views that are used more frequently and has high volume of transactions when compared to others. So, there is a need to distinguish between the CDS views that has high volume and used more frequently. This kind of views need to adhere to the performance standards and lag in response is not expected. The lag in response may occur due to the following factors:

1. Filters are not push down correctly
2. Join conditions or cyclic joins can degrade the performance
3. Redundant union or joins that exist in the CDS views
4. Currency conversion not modeled appropriately
5. Execution of CDS views generates many temporary tables. The cause may be due to materializing of the fields with aggregation on large set of rows

While designing the CDS views the above factors needs to be considered. Also, apart from the above factors there are cases wherein when the system is having very less data (<1000 rows) then the performance issues are not identified. The inherent problem within CDS view gets visible or detected when the system is having bare minimum amount of data based on the annotation like Service Quality and Size are defined.

Data Quality Metrics:


The CDS views are often categorized into ‘S’, ‘M’, ‘L’, ‘XL’ and ‘XXL’. In the CDS view, ObjectModel.usageType.sizeCategory annotation is used define the size based on the volume of data the view can expect.

The resource consumption on HANA is mainly driven by two factors:

◈ The set of data which has to be searched through and
◈ The set of data which has to be materialized in order to compute the result set.This metrics helps to identify whether sufficient number of rows exist in the HANA database. This metrics are just an indicator on whether the performance measurement for single user test can be performed or not. If these bare minimum criteria are not met, then one won’t be able to unearthen the defects that may creep over a period, when the data grows.
◈ Size category S should have less than 1000 rows. Similarly, size category M should have less than 10^5. For L, it will be 10^7 rows.

SQL statement to retrieve the table information:


The report program can be written wherein the user input is a CDS view name. With the below statement, initially it is identified whether the given input is a CDS view or not.

SELECT OBJECTNAME FROM DDLDEPENDENCY WHERE STATE = ‘A’ AND OBJECTTYPE = ‘STOB’ AND DDLNAME = ‘<CDS Name>’         INTO TABLE @DATA(ENTITYTAB).

Here <CDS_Name> to be filled is the CDS view name.

For example:

SAP Online Guides, SAP Learning, SAP Certifications, SAP Tutorials and Materials

SELECT OBJECTNAME FROM DDLDEPENDENCY WHERE STATE = 'A' AND OBJECTTYPE = 'STOB' AND DDLNAME = ‘A_CHANGEMASTER’ INTO TABLE @DATA(ENTITYTAB).

The “API_CHANGEMASTER” is whitelisted service listed in SAP API Hub. When this ODATA service is invoked by any client side application (Fiori or custom application) then it internally hits “A_CHANGEMASTER” CDS view.

When we execute the above query, we will be able to retrive the object name. In this case, it is a valid and activated CDS view. Once we get the object name, we can get the number of tables used by the CDS view.

When we want to retrieve all the CDS views that starts with ‘A_C%’ then it can be done as follows:

SELECT DISTINCT SRC~DDLNAME, TADIR~DEVCLASS AS PACKAGE, TADIR~AUTHOR AS AUTHOR
        FROM TADIR INNER JOIN DDDDLSRC AS SRC ON TADIR~OBJ_NAME = SRC~DDLNAME 
       WHERE TADIR~PGMID = 'R3TR' AND TADIR~OBJECT = 'DDLS'
       AND   SRC~AS4LOCAL = 'A'
       AND SRC~DDLNAME = 'A_C%'
       INTO TABLE @DATA(DDLS).

   IF LINES( DDLS ) > 0.
      SELECT OBJECTNAME FROM DDLDEPENDENCY FOR ALL ENTRIES IN @DDLS
        WHERE STATE = 'A' AND OBJECTTYPE = 'STOB' AND DDLNAME = @DDLS-DDLNAME
        INTO TABLE @DATA(ENTITYTAB).

Now loop through the tables to find the number of rows that present in the database. For CDS view size category, this is the starting point to know the quality of the data. To be more stringent, based on the type of application, we can also check for Distinct entries that are present in the table. This will help to identify whether the enough data is present in the table. If there is less number of entries, then the quality engineer must create the data before taking the performance measurement.

IF SY-SUBRC = 0.
        CREATE OBJECT LR_VISITOR TYPE CL_DD_DDL_META_NUM_COLLECTOR EXPORTING DESCEND = ABAP_TRUE .
        data l_name type string.

        LOOP AT ENTITYTAB INTO data(LV_ENAME).
          l_name = lv_ename-objectname.
          LR_VISITOR->VISITDDLSOURCE( iv_dsname = l_name ).
          DATA(LR_NUMBERMAP) =  LR_VISITOR->GETNUMBERMAP( ).
          READ TABLE LR_NUMBERMAP ASSIGNING FIELD-SYMBOL(<CDS_VIEW>)
      WITH KEY ENTITY = LV_ENAME-OBJECTNAME.
  IF SY-SUBRC NE 0.
    CONTINUE.
  ENDIF.
ENDIF.

data tablename type string.
  data count type i.
  loop at <cds_view>-numbers-tab_info-table_tab assigning field-symbol(<tab_info>).

    collect <tab_info> into lr_tabs.
    tablename = <tab_info>-TABNAME.
    select count(*) from (tablename) INTO count.
  endloop.

Here we get number of entries present in the table. For the above example, there is only 1 table i.e. AENR. When we query for number of entries in the table AENR using count (*), and output is 100 rows. This is less as the size category annotation is specified as ‘L’. So, bare minimum rows that are required before starting with the performance system is somewhere between 10^5 and less than 10^5.