Monday, 30 December 2024

Database Profiling

 







 SQL Column Profiling


1. Count of NULL values:   

   SELECT COUNT(*) - COUNT(column_name) AS null_count
   FROM table_name;  

2. Distinct values and their counts:   

   SELECT column_name, COUNT(*)
   FROM table_name
   GROUP BY column_name;  

3. Minimum and maximum values:   

   SELECT MIN(column_name) AS min_value, MAX(column_name) AS max_value
   FROM table_name;  

4. Average and standard deviation:   

   SELECT AVG(column_name) AS avg_value, STDDEV(column_name) AS stddev_value
   FROM table_name;  

5. String length distribution:   

   SELECT LENGTH(column_name) AS length, COUNT(*)
   FROM table_name
   GROUP BY LENGTH(column_name);


SQL Cross-Column Profiling


1. Correlation between two numeric columns:   

   SELECT CORR(column1, column2) AS correlation
   FROM table_name;  

2. Finding unique combinations of values across columns:   

   SELECT column1, column2, COUNT(*)
   FROM table_name
   GROUP BY column1, column2; 

3. Detecting functional dependencies:   

   SELECT column1, column2, COUNT(DISTINCT column2) AS unique_values
   FROM table_name
   GROUP BY column1
   HAVING COUNT(DISTINCT column2) = 1; 

4. Checking for null values across multiple columns:   

   SELECT COUNT(*) AS null_count
   FROM table_name
   WHERE column1 IS NULL OR column2 IS NULL; 


SQL Cross-Table Profiling

1. Foreign Key Relationships:   

   SELECT 
       tc.table_schema, 
       tc.table_name, 
       kcu.column_name, 
       ccu.table_schema AS foreign_table_schema,
       ccu.table_name AS foreign_table_name,
       ccu.column_name AS foreign_column_name 
   FROM 
       information_schema.table_constraints AS tc 
       JOIN information_schema.key_column_usage AS kcu
         ON tc.constraint_name = kcu.constraint_name
         AND tc.table_schema = kcu.table_schema
       JOIN information_schema.constraint_column_usage AS ccu
         ON ccu.constraint_name = tc.constraint_name
         AND ccu.table_schema = tc.table_schema
   WHERE tc.constraint_type = 'FOREIGN KEY';   

2. Join Analysis:   

   SELECT 
       t1.column1, t2.column2, COUNT(*)
   FROM 
       table1 t1
       JOIN table2 t2 ON t1.common_column = t2.common_column
   GROUP BY 
       t1.column1, t2.column2;   

3. Referential Integrity Checks:   

   SELECT 
       t1.common_column
   FROM 
       table1 t1
   LEFT JOIN 
       table2 t2 ON t1.common_column = t2.common_column
   WHERE 
       t2.common_column IS NULL;


SQL Data Rule Validation Profiling


1. Check for non-null values:   

   SELECT COUNT(*) AS null_count
   FROM table_name
   WHERE column_name IS NULL;  

2. Check for unique values:  

   SELECT column_name, COUNT(*)
   FROM table_name
   GROUP BY column_name
   HAVING COUNT(*) > 1;  

3. Check for valid ranges:   

   SELECT *
   FROM table_name
   WHERE column_name < min_value OR column_name > max_value;   

4. Check for specific patterns (e.g., email format):   

   SELECT *
   FROM table_name
   WHERE column_name NOT LIKE '%_@__%.__%';   

5. Check for foreign key constraints:   

   SELECT t1.*
   FROM table1 t1
   LEFT JOIN table2 t2 ON t1.foreign_key = t2.primary_key
   WHERE t2.primary_key IS NULL;

SQL Cardinality

1. Count Distinct Values:  

   SELECT COUNT(DISTINCT column_name) AS cardinality
   FROM table_name;   

2. High Cardinality Example:   

   SELECT column_name, COUNT(*) AS frequency
   FROM table_name
   GROUP BY column_name
   ORDER BY frequency DESC;  

3. Low Cardinality Example:   

   SELECT column_name, COUNT(*) AS frequency
   FROM table_name
   GROUP BY column_name
   HAVING COUNT(*) > 1
   ORDER BY frequency DESC;

References


 

Friday, 27 December 2024

Data Governance

 

Data governance is a set of processes, policies, and standards that ensure data is secure, accurate, and usable.


Data Governance Framework

The DAMA-DMBOK (Data Management Body of Knowledge) is a comprehensive framework developed by DAMA International to guide organizations in managing data as a strategic asset. It provides best practices and structured approaches across various aspects of data management, ensuring data quality, accessibility, and compliance



Case Study- Hospitality Domain






Data Inventory and Classification




Data Quality Management





Data Security and Privacy





Data Integration and Interoperability






Data Access and Usage






Reference:


Tuesday, 24 December 2024

Domain-Driven Design for Hotel Management System

Core Domain

The core domain represents the most critical and unique aspects of the hotel management system that provide competitive advantage.


Booking Management

  • Bounded Context: Handles room availability, reservations, and cancellations.
  • Ubiquitous Language: Booking, Reservation, Availability, Cancellation, Check-in, Check-out.


Guest Management

  • Bounded Context: Manages guest profiles, preferences, and loyalty programs.
  • Ubiquitous Language: Guest Profile, Loyalty Points, Preferences, Membership, Rewards.


Payment Processing

  • Bounded Context: Manages payment transactions, billing, and refunds.
  • Ubiquitous Language: Payment, Billing, Invoice, Refund, Transaction, Payment Gateway.


Supporting Domain

The supporting domain includes functionalities that are important but not unique to the hotel management system.

Customer Support

  • Bounded Context: Handles guest inquiries, complaints, and support tickets.
  • Ubiquitous Language: Support Ticket, Inquiry, Complaint, Resolution, Live Chat, Help Desk.


Housekeeping Management

  • Bounded Context: Manages housekeeping schedules, tasks, and inventory.
  • Ubiquitous Language: Housekeeping Schedule, Task, Inventory, Cleaning, Maintenance.


Event Management

  • Bounded Context: Manages event bookings, scheduling, and coordination.
  • Ubiquitous Language: Event Booking, Schedule, Coordination, Venue, Catering.



Generic Domain

The generic domain includes functionalities that are common across many systems and can be outsourced or reused.


Authentication and Authorization

  • Bounded Context: Manages user authentication, roles, and permissions.
  • Ubiquitous Language: User, Role, Permission, Authentication, Authorization, Login, Access Control.


Reporting and Analytics

  • Bounded Context: Generates reports and provides analytics on system usage and performance.
  • Ubiquitous Language: Report, Analytics, Dashboard, Metrics, KPI, Data Visualization.


Notification Service

  • Bounded Context: Manages sending notifications via email, SMS, and push notifications.
  • Ubiquitous Language: Notification, Email, SMS, Push Notification, Alert, Message.

Documentation for Domain-Driven Design for Hotel Management System

Sunday, 15 December 2024

Architecture Viewpoints & Artifacts

Terminology

System: A collection of components organized to accomplish specific functions.

Architecture: The fundamental organization of a system, including components, their relationships, and guiding principles.

Architecture Description: A collection of artifacts documenting an architecture. 

Stakeholders: Individuals or groups with key roles or concerns about the system, such as users, developers, or managers.

Concerns: Crucial interests of stakeholders that determine the system's acceptability, covering aspects like performance, reliability, security, and evolvability.

View: Represents the system from the perspective of related concerns.

Architecture Models: Created by architects to capture the system's design. A view comprises selected parts of one or more models to address stakeholder concerns.

Viewpoint: Defines the perspective from which a view is taken, including how to construct and use the view, the information to include, modeling techniques, and rationale.

Viewpoints: Generic and reusable, while views are specific to the architecture.

Architecture Views: Representations of the overall architecture in terms meaningful to stakeholders, enabling communication and verification that the system addresses their concerns.

Concerns vs. Requirements: Concerns are areas of interest, while requirements are specific needs derived from concerns. Requirements should be SMART (Specific, Measurable, Achievable, Relevant, Time-bound).

Reference diagram:

Note: Copied from opengroup.org





    

Case Study : Hotel Management System


Stakeholders & Concerns


View and Viewpoint


The TOGAF architecture domains are themselves viewpoints. TOGAF Architecture Domains Applied to Hotel Management System

Business Architecture Domain

  • Purpose: Addresses the needs of users, planners, and business management.
  • Stakeholders: Guests, hotel staff, and management.
  • Artifacts:
    • Business Process Models: Diagrams showing the booking process from search to confirmation.
    • Use Case Diagrams: Illustrating interactions between guests, hotel staff, and the system.
    • Business Capability Maps: Highlighting the capabilities required to support the booking process.

Data Architecture Domain

  • Purpose: Addresses the needs of database designers, database administrators, and system engineers.
  • Stakeholders: Database designers, administrators, and IT team.
  • Artifacts:
    • Data Models: Entity-relationship diagrams representing guests, bookings, payments, and room details.
    • Data Flow Diagrams: Illustrating how data moves between components like the booking engine, customer database, and payment gateway.
    • Data Catalogs: Lists of data entities, attributes, and relationships.

Application Architecture Domain

  • Purpose: Addresses the needs of system and software engineers.
  • Stakeholders: System and software engineers, developers.
  • Artifacts:
    • Application Models: Diagrams showing the structure and interactions of software components.
    • Application Interaction Matrices: Mapping interactions between different applications and services.
    • Application Catalogs: Lists of applications, their functionalities, and interfaces.

Technology Architecture Domain

  • Purpose: Addresses the needs of acquirers, operators, administrators, and managers.
  • Stakeholders: IT operators, system administrators, and managers.
  • Artifacts:
    • Technology Models: Diagrams showing the hardware and network infrastructure.
    • Technology Standards Catalogs: Lists of technology standards and guidelines.
    • Technology Roadmaps: Plans for technology upgrades and integration.


Note The TOGAF® Standard is a leading Enterprise Architecture framework that enhances business efficiency through consistent standards and methods.

Artifact can be found here  artifact 

Thursday, 5 December 2024

JVM Optimization


Understand Your Application's Behavior


Profile Your Application

Use profiling tools to understand memory usage patterns and identify memory leaks. VisualVM can be used for the same


        

https://github.com/manaspratimdas/memory-analysis/tree/master/myappsmem/src/main/java/myappsmem/heapexhaustion/ml

 

Monitor GC Logs

Regularly analyze GC logs to understand the frequency and duration of GC. While running the application we can enable GC logging with below JVM argument as follows 

-Xlog:gc*:file=gclog/gc_%t_%p.log


Choose the Right Garbage Collector

  •    Serial GC: -XX:+UseSerialGC: Suitable for small applications with low memory requirements.
  •    Parallel GC: -XX:+UseParallelGC : Good for applications with high throughput requirements.
  •    CMS GC: -XX:+UseConcMarkSweepGC : Suitable for applications requiring low pause times.
  •    G1 GC: -XX:+UseG1GC : A balanced option for applications with large heaps and requiring predictable pause

      Illustration: Serial GC vs G1 GC

https://github.com/manaspratimdas/memory-analysis/blob/master/myappsmem/src/main/java/myappsmem/optimization/MyAppMemAnalyzer.java

Run the program with JVM arguments 

  • -Xlog:gc*:file=gclog/gc_%t_%p.log -XX:+UseSerialGC
  • -Xlog:gc*:file=gclog/gc_%t_%p.log -XX:+UseG1GC



Tune Heap Size


Set Initial and Maximum Heap Size

Use `-Xms` and `-Xmx` to set the initial and maximum heap size. It's often recommended to set them to the same value to avoid resizing during runtime

  • -Xlog:gc*:file=gclog/gc_%t_%p.log (Default) [ it took xms as 254 and xmx as 4048]
  • -Xms16m -Xmx512m -Xlog:gc*:file=gclog/gc_%t_%p.log
  • -Xms64m -Xmx64m -Xlog:gc*:file=gclog/gc_%t_%p.log







Adjust Young Generation Size

Use `-XX:NewSize` and `-XX:MaxNewSize` to tune the size of the young generation. A larger young generation can reduce the frequency of minor GC

  • -XX:NewSize=128m -XX:MaxNewSize=128m: No GCs occurred, indicating that the memory allocation was sufficient to avoid GC events.
  • -XX:NewSize=64m -XX:MaxNewSize=64m: One GC event occurred, suggesting that the memory allocation was almost sufficient but required one cleanup.
  • -XX:NewSize=16m -XX:MaxNewSize=16m: Seven GC events occurred, indicating that the memory allocation was insufficient, leading to frequent GCs.

When the number of garbage collections (GC) increases, it can significantly impact the performance of your application
  • GC Pause Time: Frequent GCs cause more pauses, degrading application responsiveness and throughput.
  • CPU Usage: Higher GC frequency increases CPU usage, as more time is spent on memory management.
  • Latency: More frequent GCs lead to higher latency, affecting real-time performance.
  • Memory Fragmentation: Frequent GCs can cause memory fragmentation, slowing down memory








Monday, 2 December 2024

JVM  Profiling with Eclipse MAT

JVM Profiling refers to the process of analyzing the performance and behavior of applications running on the Java Virtual Machine (JVM).






Heap Memory Issue




Eclipse Memory Analyzer Tool (MAT)

powerful Java heap analyzer that helps you identify memory leaks and optimize memory usage in Java applications.

  • Heap Dump Analysis
  • Leak Suspects Report
  • Retained Sizes Calculation
  • Memory Consumption Patterns

Shallow Heap: The shallow heap of an object is the amount of memory that is directly allocated for that object. 

Retained Heap: The retained heap of an object is the amount of memory that will be freed when the object is garbage collected. It includes the shallow heap of the object and the shallow heap of all objects that are reachable only from that object. 




Dominator Tree

A dominator tree is a representation of the object graph where each node (object) is dominated by its parent. An object X is said to dominate an object Y if every path from the root to Y must pass through X1.
Purpose: The dominator tree helps you identify the largest chunks of retained memory and understand the keep-alive dependencies among objects.


Path to GC





Eclipse MAT configuration




How to create Heap Dump


Configure below VM argument while running the java application

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=C:/Users/xxxx/myapps/perf/heapdump -Xmx512m

Important Reports in MAT


Leak Report



Histogram Report




Dominator Report




Details on analysis can be found below

Heap Exhaustion – Memory Leak





Streaming with Kafka API

The Kafka Streams API is a Java library for building real-time applications and microservices that efficiently process and analyze large-sca...