Monday, 30 December 2024

Database Profiling

 







 SQL Column Profiling


1. Count of NULL values:   

   SELECT COUNT(*) - COUNT(column_name) AS null_count
   FROM table_name;  

2. Distinct values and their counts:   

   SELECT column_name, COUNT(*)
   FROM table_name
   GROUP BY column_name;  

3. Minimum and maximum values:   

   SELECT MIN(column_name) AS min_value, MAX(column_name) AS max_value
   FROM table_name;  

4. Average and standard deviation:   

   SELECT AVG(column_name) AS avg_value, STDDEV(column_name) AS stddev_value
   FROM table_name;  

5. String length distribution:   

   SELECT LENGTH(column_name) AS length, COUNT(*)
   FROM table_name
   GROUP BY LENGTH(column_name);


SQL Cross-Column Profiling


1. Correlation between two numeric columns:   

   SELECT CORR(column1, column2) AS correlation
   FROM table_name;  

2. Finding unique combinations of values across columns:   

   SELECT column1, column2, COUNT(*)
   FROM table_name
   GROUP BY column1, column2; 

3. Detecting functional dependencies:   

   SELECT column1, column2, COUNT(DISTINCT column2) AS unique_values
   FROM table_name
   GROUP BY column1
   HAVING COUNT(DISTINCT column2) = 1; 

4. Checking for null values across multiple columns:   

   SELECT COUNT(*) AS null_count
   FROM table_name
   WHERE column1 IS NULL OR column2 IS NULL; 


SQL Cross-Table Profiling

1. Foreign Key Relationships:   

   SELECT 
       tc.table_schema, 
       tc.table_name, 
       kcu.column_name, 
       ccu.table_schema AS foreign_table_schema,
       ccu.table_name AS foreign_table_name,
       ccu.column_name AS foreign_column_name 
   FROM 
       information_schema.table_constraints AS tc 
       JOIN information_schema.key_column_usage AS kcu
         ON tc.constraint_name = kcu.constraint_name
         AND tc.table_schema = kcu.table_schema
       JOIN information_schema.constraint_column_usage AS ccu
         ON ccu.constraint_name = tc.constraint_name
         AND ccu.table_schema = tc.table_schema
   WHERE tc.constraint_type = 'FOREIGN KEY';   

2. Join Analysis:   

   SELECT 
       t1.column1, t2.column2, COUNT(*)
   FROM 
       table1 t1
       JOIN table2 t2 ON t1.common_column = t2.common_column
   GROUP BY 
       t1.column1, t2.column2;   

3. Referential Integrity Checks:   

   SELECT 
       t1.common_column
   FROM 
       table1 t1
   LEFT JOIN 
       table2 t2 ON t1.common_column = t2.common_column
   WHERE 
       t2.common_column IS NULL;


SQL Data Rule Validation Profiling


1. Check for non-null values:   

   SELECT COUNT(*) AS null_count
   FROM table_name
   WHERE column_name IS NULL;  

2. Check for unique values:  

   SELECT column_name, COUNT(*)
   FROM table_name
   GROUP BY column_name
   HAVING COUNT(*) > 1;  

3. Check for valid ranges:   

   SELECT *
   FROM table_name
   WHERE column_name < min_value OR column_name > max_value;   

4. Check for specific patterns (e.g., email format):   

   SELECT *
   FROM table_name
   WHERE column_name NOT LIKE '%_@__%.__%';   

5. Check for foreign key constraints:   

   SELECT t1.*
   FROM table1 t1
   LEFT JOIN table2 t2 ON t1.foreign_key = t2.primary_key
   WHERE t2.primary_key IS NULL;

SQL Cardinality

1. Count Distinct Values:  

   SELECT COUNT(DISTINCT column_name) AS cardinality
   FROM table_name;   

2. High Cardinality Example:   

   SELECT column_name, COUNT(*) AS frequency
   FROM table_name
   GROUP BY column_name
   ORDER BY frequency DESC;  

3. Low Cardinality Example:   

   SELECT column_name, COUNT(*) AS frequency
   FROM table_name
   GROUP BY column_name
   HAVING COUNT(*) > 1
   ORDER BY frequency DESC;

References


 

Friday, 27 December 2024

Data Governance

 

Data governance is a set of processes, policies, and standards that ensure data is secure, accurate, and usable.


Data Governance Framework

The DAMA-DMBOK (Data Management Body of Knowledge) is a comprehensive framework developed by DAMA International to guide organizations in managing data as a strategic asset. It provides best practices and structured approaches across various aspects of data management, ensuring data quality, accessibility, and compliance



Case Study- Hospitality Domain






Data Inventory and Classification




Data Quality Management





Data Security and Privacy





Data Integration and Interoperability






Data Access and Usage






Reference:


Tuesday, 24 December 2024

Domain-Driven Design for Hotel Management System

Core Domain

The core domain represents the most critical and unique aspects of the hotel management system that provide competitive advantage.


Booking Management

  • Bounded Context: Handles room availability, reservations, and cancellations.
  • Ubiquitous Language: Booking, Reservation, Availability, Cancellation, Check-in, Check-out.


Guest Management

  • Bounded Context: Manages guest profiles, preferences, and loyalty programs.
  • Ubiquitous Language: Guest Profile, Loyalty Points, Preferences, Membership, Rewards.


Payment Processing

  • Bounded Context: Manages payment transactions, billing, and refunds.
  • Ubiquitous Language: Payment, Billing, Invoice, Refund, Transaction, Payment Gateway.


Supporting Domain

The supporting domain includes functionalities that are important but not unique to the hotel management system.

Customer Support

  • Bounded Context: Handles guest inquiries, complaints, and support tickets.
  • Ubiquitous Language: Support Ticket, Inquiry, Complaint, Resolution, Live Chat, Help Desk.


Housekeeping Management

  • Bounded Context: Manages housekeeping schedules, tasks, and inventory.
  • Ubiquitous Language: Housekeeping Schedule, Task, Inventory, Cleaning, Maintenance.


Event Management

  • Bounded Context: Manages event bookings, scheduling, and coordination.
  • Ubiquitous Language: Event Booking, Schedule, Coordination, Venue, Catering.



Generic Domain

The generic domain includes functionalities that are common across many systems and can be outsourced or reused.


Authentication and Authorization

  • Bounded Context: Manages user authentication, roles, and permissions.
  • Ubiquitous Language: User, Role, Permission, Authentication, Authorization, Login, Access Control.


Reporting and Analytics

  • Bounded Context: Generates reports and provides analytics on system usage and performance.
  • Ubiquitous Language: Report, Analytics, Dashboard, Metrics, KPI, Data Visualization.


Notification Service

  • Bounded Context: Manages sending notifications via email, SMS, and push notifications.
  • Ubiquitous Language: Notification, Email, SMS, Push Notification, Alert, Message.

Documentation for Domain-Driven Design for Hotel Management System

Sunday, 15 December 2024

Architecture Viewpoints & Artifacts

Terminology

System: A collection of components organized to accomplish specific functions.

Architecture: The fundamental organization of a system, including components, their relationships, and guiding principles.

Architecture Description: A collection of artifacts documenting an architecture. 

Stakeholders: Individuals or groups with key roles or concerns about the system, such as users, developers, or managers.

Concerns: Crucial interests of stakeholders that determine the system's acceptability, covering aspects like performance, reliability, security, and evolvability.

View: Represents the system from the perspective of related concerns.

Architecture Models: Created by architects to capture the system's design. A view comprises selected parts of one or more models to address stakeholder concerns.

Viewpoint: Defines the perspective from which a view is taken, including how to construct and use the view, the information to include, modeling techniques, and rationale.

Viewpoints: Generic and reusable, while views are specific to the architecture.

Architecture Views: Representations of the overall architecture in terms meaningful to stakeholders, enabling communication and verification that the system addresses their concerns.

Concerns vs. Requirements: Concerns are areas of interest, while requirements are specific needs derived from concerns. Requirements should be SMART (Specific, Measurable, Achievable, Relevant, Time-bound).

Reference diagram:

Note: Copied from opengroup.org





    

Case Study : Hotel Management System


Stakeholders & Concerns


View and Viewpoint


The TOGAF architecture domains are themselves viewpoints. TOGAF Architecture Domains Applied to Hotel Management System

Business Architecture Domain

  • Purpose: Addresses the needs of users, planners, and business management.
  • Stakeholders: Guests, hotel staff, and management.
  • Artifacts:
    • Business Process Models: Diagrams showing the booking process from search to confirmation.
    • Use Case Diagrams: Illustrating interactions between guests, hotel staff, and the system.
    • Business Capability Maps: Highlighting the capabilities required to support the booking process.

Data Architecture Domain

  • Purpose: Addresses the needs of database designers, database administrators, and system engineers.
  • Stakeholders: Database designers, administrators, and IT team.
  • Artifacts:
    • Data Models: Entity-relationship diagrams representing guests, bookings, payments, and room details.
    • Data Flow Diagrams: Illustrating how data moves between components like the booking engine, customer database, and payment gateway.
    • Data Catalogs: Lists of data entities, attributes, and relationships.

Application Architecture Domain

  • Purpose: Addresses the needs of system and software engineers.
  • Stakeholders: System and software engineers, developers.
  • Artifacts:
    • Application Models: Diagrams showing the structure and interactions of software components.
    • Application Interaction Matrices: Mapping interactions between different applications and services.
    • Application Catalogs: Lists of applications, their functionalities, and interfaces.

Technology Architecture Domain

  • Purpose: Addresses the needs of acquirers, operators, administrators, and managers.
  • Stakeholders: IT operators, system administrators, and managers.
  • Artifacts:
    • Technology Models: Diagrams showing the hardware and network infrastructure.
    • Technology Standards Catalogs: Lists of technology standards and guidelines.
    • Technology Roadmaps: Plans for technology upgrades and integration.


Note The TOGAF® Standard is a leading Enterprise Architecture framework that enhances business efficiency through consistent standards and methods.

Artifact can be found here  artifact 

Thursday, 5 December 2024

JVM Optimization


Understand Your Application's Behavior


Profile Your Application

Use profiling tools to understand memory usage patterns and identify memory leaks. VisualVM can be used for the same


        

https://github.com/manaspratimdas/memory-analysis/tree/master/myappsmem/src/main/java/myappsmem/heapexhaustion/ml

 

Monitor GC Logs

Regularly analyze GC logs to understand the frequency and duration of GC. While running the application we can enable GC logging with below JVM argument as follows 

-Xlog:gc*:file=gclog/gc_%t_%p.log


Choose the Right Garbage Collector

  •    Serial GC: -XX:+UseSerialGC: Suitable for small applications with low memory requirements.
  •    Parallel GC: -XX:+UseParallelGC : Good for applications with high throughput requirements.
  •    CMS GC: -XX:+UseConcMarkSweepGC : Suitable for applications requiring low pause times.
  •    G1 GC: -XX:+UseG1GC : A balanced option for applications with large heaps and requiring predictable pause

      Illustration: Serial GC vs G1 GC

https://github.com/manaspratimdas/memory-analysis/blob/master/myappsmem/src/main/java/myappsmem/optimization/MyAppMemAnalyzer.java

Run the program with JVM arguments 

  • -Xlog:gc*:file=gclog/gc_%t_%p.log -XX:+UseSerialGC
  • -Xlog:gc*:file=gclog/gc_%t_%p.log -XX:+UseG1GC



Tune Heap Size


Set Initial and Maximum Heap Size

Use `-Xms` and `-Xmx` to set the initial and maximum heap size. It's often recommended to set them to the same value to avoid resizing during runtime

  • -Xlog:gc*:file=gclog/gc_%t_%p.log (Default) [ it took xms as 254 and xmx as 4048]
  • -Xms16m -Xmx512m -Xlog:gc*:file=gclog/gc_%t_%p.log
  • -Xms64m -Xmx64m -Xlog:gc*:file=gclog/gc_%t_%p.log







Adjust Young Generation Size

Use `-XX:NewSize` and `-XX:MaxNewSize` to tune the size of the young generation. A larger young generation can reduce the frequency of minor GC

  • -XX:NewSize=128m -XX:MaxNewSize=128m: No GCs occurred, indicating that the memory allocation was sufficient to avoid GC events.
  • -XX:NewSize=64m -XX:MaxNewSize=64m: One GC event occurred, suggesting that the memory allocation was almost sufficient but required one cleanup.
  • -XX:NewSize=16m -XX:MaxNewSize=16m: Seven GC events occurred, indicating that the memory allocation was insufficient, leading to frequent GCs.

When the number of garbage collections (GC) increases, it can significantly impact the performance of your application
  • GC Pause Time: Frequent GCs cause more pauses, degrading application responsiveness and throughput.
  • CPU Usage: Higher GC frequency increases CPU usage, as more time is spent on memory management.
  • Latency: More frequent GCs lead to higher latency, affecting real-time performance.
  • Memory Fragmentation: Frequent GCs can cause memory fragmentation, slowing down memory








Monday, 2 December 2024

JVM  Profiling with Eclipse MAT

JVM Profiling refers to the process of analyzing the performance and behavior of applications running on the Java Virtual Machine (JVM).






Heap Memory Issue




Eclipse Memory Analyzer Tool (MAT)

powerful Java heap analyzer that helps you identify memory leaks and optimize memory usage in Java applications.

  • Heap Dump Analysis
  • Leak Suspects Report
  • Retained Sizes Calculation
  • Memory Consumption Patterns

Shallow Heap: The shallow heap of an object is the amount of memory that is directly allocated for that object. 

Retained Heap: The retained heap of an object is the amount of memory that will be freed when the object is garbage collected. It includes the shallow heap of the object and the shallow heap of all objects that are reachable only from that object. 




Dominator Tree

A dominator tree is a representation of the object graph where each node (object) is dominated by its parent. An object X is said to dominate an object Y if every path from the root to Y must pass through X1.
Purpose: The dominator tree helps you identify the largest chunks of retained memory and understand the keep-alive dependencies among objects.


Path to GC





Eclipse MAT configuration




How to create Heap Dump


Configure below VM argument while running the java application

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=C:/Users/xxxx/myapps/perf/heapdump -Xmx512m

Important Reports in MAT


Leak Report



Histogram Report




Dominator Report




Details on analysis can be found below

Heap Exhaustion – Memory Leak





Tuesday, 5 November 2024

Event Driven Microservice Architecture using Kafka

 

Electronic Health Record (EHR) Microservice is a robust, modern system developed using the Spring Boot framework.

It is designed with a microservices architecture pattern and employs Event-Driven Architecture (EDA) with Kafka as the message broker. EDA allows the system to respond in real time, enabling asynchronous communication between microservices through Kafka, which provides high-throughput and low-latency message delivery. 

  1. Microservice Architecture Patterns
  • API Gateway Pattern: Implemented using Spring Cloud Gateway, it acts as a single entry point for clients, routing requests to appropriate microservices and aggregating the responses. This reduces the number of round trips between the client and application.
  • Database per Service: Each microservice has its own dedicated database, either SQL or noSQL. This ensures loose coupling, improves performance, and enables each service to use a type of database that is most suited to its needs.
  • Event Sourcing Pattern: This pattern ensures all changes to application state are stored as a sequence of events. Not just can we query these events, we can also use the event log to reconstruct past states, and as a foundation to automatically adjust the state to cope with retroactive changes.
  • Command Query Responsibility Segregation (CQRS): This pattern separates read and update operations for a data store, optimizing performance, scalability, and security.
  • Distributed Tracing: Implemented using Sleuth and Zipkin, it provides visibility into the system by tracing requests across multiple services. This is crucial for debugging and latency optimization.
  • External Configuration: Using Spring Cloud Config Server and library, configuration across multiple services can be managed centrally. This is particularly useful in a microservices environment where there are many services to manage.
  • Service Discovery Pattern: Implemented using Netflix Eureka server and client, it allows microservices to find and communicate with each other without hard-coding hostname and port. This is crucial in dynamic environments where the number of instances of a service can autoscale.
  • Circuit Breaker Pattern: Implemented using Resilience4j, it prevents a network or a service failure from cascading to other services. It does so by stopping the system from making a failing service call, returning a default response instead. 
  1. Caching: Using Ehcache, it significantly improves the performance of the system by storing the data in memory that the application frequently accesses. 
  1. Security: Implemented using Spring Security with Role-Based Access Control (RBAC), it ensures data access control at granular level, making the system more secure. 
  1. Customize Auditing Service Microservice: This feature provides crucial information about the activities in the system, it helps in debugging and provides an audit trail which is a requirement for many compliance standards. 
  1. Translation using FHIR (Fast Healthcare Interoperability Resources): FHIR is a standard for health care data exchange, published by HL7, which makes integration of EHR data smoother and more robust. 
  1. Unit Test Coverage: Implemented using JUnit 5, it ensures that the codebase works as expected and helps to prevent regressions when changes are made. 
  1. Behavior-Driven Development (BDD): Implemented using Cucumber, it encourages collaboration between developers, QA and non-technical or business participants in a software project. 
  1. Error and Exception Handling: Custom classes are used to handle errors and exceptions, ensuring a consistent strategy for handling errors and exceptions across the application. 
  1. Continuous Integration and Continuous Deployment: Using Jenkins 
  1.  Containerization: using docker
















 

This meticulously designed and implemented architecture allows the EHR Microservice to be highly scalable, maintainable, efficient, and secure, making it well-suited to the complex and sensitive nature of health record management.

 

Below is the list of the microservice

 1.        Netflix Eureka Discovery Microservice:

https://github.com/manaspratimdas/starter-eda-microservice/tree/eureka-server 

2.        Spring Cloud API Gateway

 https://github.com/manaspratimdas/api-gateway/tree/master 

3.        Spring Configuration Server

https://github.com/manaspratimdas/configserver-app 

4.        Configuration Library

https://github.com/manaspratimdas/configserver 

5.        Shared Library

https://github.com/manaspratimdas/share 

6.        Audit Service

https://github.com/manaspratimdas/audit-service-ms 

7.        PatientRecordMS

https://github.com/manaspratimdas/patient-record-ms 

8.        InteroperabilityMS

https://github.com/manaspratimdas/interoperability-ms

 

 

 

 

 

 

 

 

 

 



Saturday, 22 June 2024

Event Driven Architecture


Event-Driven Architecture (EDA) is a software architecture paradigm promoting the production, detection, consumption of, and reaction to events. An event can be defined as a significant change in state, and this architecture revolves around the idea of responding to these actions or changes. EDA is widely used because it allows for high responsiveness, flexibility, and scalability in applications.







  1. Fire and Forget: User activity data sent to a logging service, no response required.
  2. Reliable Delivery: E-commerce applications retry until orders are processed.
  3. Infinite Stream of Events: Twitter processes endless stream of tweets.
  4. Anomaly Detection/Pattern Recognition: Real-time fraud detection in credit card transactions.
  5. Broadcasting: Stock price changes broadcasted to subscribed traders.
  6. Buffering: Netflix loads video chunks for uninterrupted playback.








 

 

Description

Example

Event Streaming

This pattern involves a continuous, real-time flow of events that are produced and immediately processed. There is no explicit subscription needed as all events are available to be processed

A music streaming app like Spotify uses event streaming. When a user plays a song, the data (song) is streamed or sent continuously allowing the user to listen to it in real time

Publisher/Subscriber

This pattern involves publishers producing events and subscribers actively choosing which events they want to listen to and react to. It allows for a decoupling of the event source from its consumers

A blog website can use this pattern. When a blogger (publisher) posts a new blog, only the users (subscribers) who have subscribed to that particular blogger or topic will be notified.

 

 


Saga Design Pattern


  • The Saga design pattern is used to manage transactions that span multiple services in a microservices architecture.
  • It maintains data consistency across services with a series of local transactions.
  • Example: In an e-commerce app, placing an order may involve multiple services such as inventory, payment, and shipping. The Saga pattern ensures that either all these operations succeed, or, if one fails, appropriate compensating transactions are executed.

  

 

Orchestration

Message Broker

Description

Orchestration involves a central controller (orchestrator) that dictates how the services will interact with one another.

A message broker pattern uses a communication platform for services to send messages between each other.

Key Point

Provides centralized control and is simple to understand, but can create a bottleneck.

Decouples services and allows asynchronous communication, but adds complexity and requires careful handling of message delivery.







Command Query Responsibility Segregation (CQRS)

This pattern separates read and write operations for improved performance and scalability.

Use Case 1: Separate Read and Write

Separate Read and Write in an ecommerce system, frequent read operations like browsing products can be optimized separately from less frequent write operations like inventory updates. 

Use Case 2: Materialized View

Materialized View, combining data from multiple microservice to create a joined view




 


Event Sourcing


  • Event Sourcing is a design pattern that ensures all changes to application state are stored as a sequence of events. These events can be queried, and the state can be reconstructed at any point of time.
  • It's useful in systems where complete history of actions is necessary, like in banking transactions where all transactions are recorded and can be used to trace the current balance.



Contract Testing

  • Contract testing verifies microservices interaction using expected requests and responses.
  • It makes issue detection easier without full integration.
  • It allows independent development and scaling of microservices
  • Pact is a tool that can be used for contract testing












Friday, 21 June 2024

Domain Driven Design

 


Architectural Patterns  :  Reusable Solution Pattern

1. Communication Patterns

   - SOA: Services provided through a protocol, like separate services in banking systems.

   - Message Bus Architecture: Components communicate through a common channel, e.g., services in an e-commerce platform. 

2. Structural Patterns

   - Layered Architecture: Separates concerns into layers, like distinct layers in a mobile app.

   - OOAD: Uses visualization and abstraction for system models, e.g., modeling in video game design. 

3. Deployment Patterns

   - Client-Server: Server provides services to clients, like a web server serving browsers.

   - 3-Tier Architecture: Separates application into data, logic, and presentation tiers, e.g., an e-commerce site. 

4. Data Patterns

   - Data-Centric Design: Focuses on data and its transformation, like in a data analytics system.

   - DFD: Graphical representation of data flow, e.g., visualizing data flow in a retail management system. 

5. Domain Patterns

   - Domain-Driven Design: Focuses on the core domain and logic, such as patient care in a healthcare system.


Domain-Driven Design : Focus is on the business domain






  • Supporting : Auxiliary functionality, like an authentication module in software applications.
  • Core : Main business logic and functionalities, separating factor on the business like product catalogue
  • Generic : Necessary but not business-specific functionalities, like a logging module in a system.






Bounded Context

In Domain-Driven Design, a Bounded Context is like a fenced garden where specific models exist. Imagine a university with departments like Administration, Academics, and Sports. Each department, with its own rules and processes, represents a Bounded Context. The term "registration" might mean course enrolment in Academics, but signifies sports event sign-up in Sports.

In software, Bounded Contexts maintain model integrity within each area, reducing confusion and enhancing system design clarity.






Strategic Pattern

 
  • Big Ball of Mud: A system with no clear structure, like a patched, extended legacy system.
  • Anti-Corruption Layers: Protects your system from poorly designed components, acting as a translator, e.g., when integrating with an old billing system
  • Separate Ways: Developing independent systems when shared models aren't beneficial, like separate systems for production and HR in a company.
  • Open Host Services: Systems expose their functionality in a technology-agnostic manner, e.g., a bookstore's inventory system exposing data through a RESTful API.
  • Conformist: Conforming to an existing model when maintaining a separate one isn't practical, e.g., a new microservice adopting the data models of an existing system.
  • Customer/Supplier Team: One team acts as a customer and another as a supplier, like the backend team providing APIs and the frontend team using them.
  • Shared Kernel: A common model shared between different systems, like two microservices sharing a product catalogue model.

Tactical Pattern

·  Entity, Value, Aggregate Object: Entities are unique, values are immutable, and aggregates are associated objects treated as a unit. In an e-commerce app, a User is an entity, their Address is a value, and a User with their Orders is an aggregate. 

·   Repository Pattern: This pattern mediates between the domain and data mapping layers like an in-memory object collection. In a book store app, a repository handles data operations for books. 

·  Domain, Application, Infrastructure Services: Domain services encapsulate business logic; application services delegate work; infrastructure services communicate with external systems. In a banking app, interest calculation is a domain service, money transfer is an application service, and email notification is an infrastructure service.

·   Anemic and Rich Model: Anemic model separates data and behavior, while a rich model encapsulates them together. In an anemic model, an Order object holds data and a separate service handles logic. In a rich model, the Order object contains data and methods to manipulate it.










Streaming with Kafka API

The Kafka Streams API is a Java library for building real-time applications and microservices that efficiently process and analyze large-sca...