Wednesday, 15 January 2025

Streaming with Kafka API

The Kafka Streams API is a Java library for building real-time applications and microservices that efficiently process and analyze large-scale data streams using Kafka's distributed architecture.

Key Feature

  • Scalability and Fault Tolerance: Highly scalable and fault-tolerant, ensuring reliable data processing.
  • Stateful and Stateless Processing: Supports both stateful operations (e.g., aggregations and joins) and stateless operations (e.g., filtering and mapping).
  • Event-Time Processing: Handles event-time processing, crucial for applications needing to process events based on their occurrence time.
  • Integration with Kafka: Seamlessly integrates with Kafka, allowing consumption from and production to Kafka topics.

Use Cases

  • Real-Time Analytics: Analyze data in real-time for insights and decision-making.
  • Monitoring and Alerting: Monitor systems and trigger alerts based on real-time data.
  • Data Transformation: Transform and enrich data streams before storing or further processing.




Topology

A topology in Kafka Streams defines the computational logic of your application as a directed graph of processors and streams, specifying how input data is transformed into output data.


Processor


A processor in Kafka Streams is a node within the topology that represents a single processing step. Processors can be either stateless or stateful:

Processors use the Processor API to implement custom logic with methods like process(), init(), and close() for handling records, initialization, and cleanup. State is managed using state stores

State stores


In Kafka Streams, state stores are local, queryable storage engines that maintain state information for stateful operations, ensuring durability, fault tolerance, and data consistency

KStream

A KStream is an abstraction in Kafka Streams representing a continuous stream of records, where each record is a key-value pair. It is used for processing and transforming data in real-time.

  • KStream provides access to all records in a Kafka topic, treating each event independently.
  • Each event is processed by the entire topology and is immediately available to the KStream.
  • KStream, also known as a record stream or log, represents an infinite stream of records, similar to inserts in a database table.

Key Features of KStream

  • Record Stream: Represents an unbounded, continuous flow of records.
  • Transformations & Joins: Supports filtering, mapping, flat-mapping, and joining with KStreams, KTables, and GlobalKTables.
  • Aggregations & Fault Tolerance: Allows for aggregations, windowed operations, and ensures fault tolerance through Kafka's architecture.


How Kafka Streams executes the Topology ?




Serialization / Deserialization


Serdes is the factory class in Kafka Streams that takes care of handling serialization and deserialization of key and value.


Error Handlers in Kafka Streams





KTable

KTable is an abstraction in Kafka Streams which holds latest value for a given key.

  • Update-Stream/Change Log: KTable represents the latest value for a given key in a Kafka record.
  • Key-Based Updates: Records with the same key update the previous value; records without a key are ignored.
  • Relational DB Analogy: Similar to an update operation in a table for a given primary key.



Streaming with Kafka API 











Friday, 10 January 2025

Request for Proposal (RFP) for Software Development


Purpose of RFPs:

    • Engage suitable vendors by eliminating ambiguities in requirements and deliverables.
    • Essential for companies lacking expertise in creating such documents.

 

What is an RFP for Software Development?:

    • An initial document outlining the software project, inviting vendors to submit proposals.
    • Ensures project transparency and mitigates risks of contracting unsuitable providers.
    • Simplifies establishing business partnerships and aligning expectations.

 

Who Should Write an RFP?:

    • Involve relevant members like product owners, project managers, business analysts, and engineers.
    • In small startups, CEOs or co-founders may create RFPs; in larger organizations, executive managers or procurement professionals handle this.

 

Benefits of RFPs for Software Development:

    • Select the right software development company.
    • Clarify project requirements.
    • Save time by minimizing repetitive questions.
    • Ensure transparent contracts.

 

Steps in the RFP Process:

    • Step 1: Executive Summary: Include project overview, company description, and project goals.
    • Step 2: Project Scope: Detail project management, infrastructure, functional, quality assurance, and platform requirements.
    • Step 3: Timeline: Set priorities and deadlines.
    • Step 4: Pricing Model and Budget: Outline costs and budget considerations.
    • Step 5: Vendor Bids and Selection Criteria: Establish evaluation criteria and streamline bid comparison.

 



Tips for Effective RFPs:

    • Keep RFPs clear and concise.
    • Prioritize value over low prices.
    • Present pain points rather than prescribed solutions.
    • Include a list of required features.
    • Preselect 3-5 companies to avoid information overload.

Request for Proposal (RFP) for Hotel Management System Modernization - A Case Study


Hotel MPD International, a well-established hotel with over 10 years in the industry, seeks to modernize its hotel management system. The current system is a monolithic application deployed on-premises using legacy technology. The goal is to transition to a modern, cloud-based system utilizing the latest technology to enhance efficiency, scalability, and guest experience.

Details here : https://github.com/manaspratimdas/hms/blob/main/RFP/01_rfp_modernization.md


Wednesday, 8 January 2025

Minimizing Toil & Wastage in Software Development

 

Toil

  • Definition: Toil refers to tasks that are manual, repetitive, automatable, tactical, and devoid of enduring value. These tasks scale linearly with the growth of the service and do not contribute to long-term improvements.
  • Examples: Routine maintenance, manual deployments, repetitive testing, and handling alerts manually.

Wastage

  • Definition: Wastage involves any activity that does not add value to the end product or service. This includes inefficient processes, unnecessary steps, and delays.
  • Examples: Waiting for approvals, redundant meetings, and excessive debugging due to poor code quality.





Smart CI-CD Pipeline

Continuous Integration (CI)
  • Linting: Analyzes source code to flag errors and stylistic issues, ensuring code quality and consistency.
  • Unit Test Coverage: Verifies individual components work as intended, detecting issues early.
  • Static Application Security Testing (SAST): Analyzes code for security vulnerabilities without execution, identifying security issues early.
  • Build: Compiles source code into executable artifacts, ensuring code is ready for testing and deployment.
  • Publish: Stores built artifacts in a repository, centralizing management and distribution of artifacts.
Continuous Delivery (CD)
  • Deploy to Non-Prod Environments: Deploys application to various environments iteratively, ensuring thorough testing in production-like environments.
  • Dynamic Application Security Testing (DAST): Identifies security vulnerabilities in web applications by simulating real-world attacks, providing a realistic assessment of security.
  • Regression Testing: Ensures new code changes do not adversely affect existing functionality, maintaining software stability.
  • Performance Testing: Evaluates the speed, responsiveness, and stability of a system under a given workload, enhancing user satisfaction.





Left Shift Strategy

  • Purpose: Integrates testing, quality assurance early.
  • Benefits: Early detection, improved collaboration, enhanced quality.
  • Examples: Continuous Integration (CI), Test-Driven Development (TDD), Static Code Analysis, Pair Programming, Automated Unit Testing, optimizing inner loop
Security left shift example







Automate the Code Review

  • Purpose: Maintain code quality and consistency using tools and AI.
  • Benefits:
    • Time Savings: Automates routine checks, freeing up developers' time.
    • Consistency: Ensures consistent enforcement of coding standards.
    • Improved Quality: Identifies potential issues and optimizations early.
  • Bitbucket / Github
    • Pull Request Templates: Standardize PRs with necessary information.
    • Static Code Analysis Tools: Use SonarQube or CodeClimate for quality and security checks.
    • Automate Reviewer Assignment: Assign reviewers automatically based on code changes.
    • Merge Checks: Enforce quality requirements before merging PRs.

Bulk Build & Deployment 


Bulk Build & Deployment is utilized in Release Management by some organizations to streamline the release process, where different teams build and deploy a large number of applications on a specific day of the sprint.

Challenges : Coordination Complexity, Build Quality, Resource Allocation, Visibility and Tracking, Error Handling, Environment Consistency, Automated Testing, Rollback Procedures.

Automate Creation of the Branch for Bulk Apps for Every Sprint
  • Reduces manual effort and ensures consistency.
  • Speeds up the initial setup process.

Automate PR Creation
  • Streamlines code integration and reduces administrative tasks.
  • Ensures standardized and error-free PRs.

Automate Merging of PR
  • Speeds up the integration process and reduces repetitive tasks.
  • Ensures consistent and error-free merges.

Scheduled Automate Build and Deployment in Non-Production
  • Ensures regular testing and deployment, catching issues early.
  • Reduces manual effort and provides continuous feedback.

Bulk Branch Deletion of Not Used Branches After the Release
  • Keeps the repository clean and organized.
  • Frees up resources and reduces manual cleanup tasks.

Monitoring and Alerting


Purpose: Track the performance and health of applications.

Benefits:
  • Proactive Issue Resolution: Detects and resolves issues before they impact users.
  • Improved Reliability: Ensures the system remains reliable and available.
  • Data-Driven Decisions: Provides insights for informed system improvements.

Scripting and scheduling

  • Cleanup Script: Automates removal of temporary files and old logs to free up disk space.
  • Space Check Script: Monitors disk space usage and alerts when thresholds are exceeded.
  • Installation Script: Automates software installation and configuration for consistency and efficiency.
  • Log Management and Rotation: Manages and rotates log files to prevent excessive disk space usage.
  • New Repository Creation Script with Predefined Rules: Automates creation of new repositories with predefined structures and rules.









Saturday, 4 January 2025

GenAI Fundamentals with LangChain

LangChain 

This is a Framework for developing applications powered by large language models (LLMs)

LLM : Machine learning that can comprehend and generate human language text. They work by analyzing massive data sets of language.


Prompts & Prompt Chaining

Prompt: Prompt is an input that a user provides to an AI model to get a specific response

PromptTemplate


Prompt Chaining: Sequential prompts enhance model coherence and structure


Sequential Chain




Code snippet and test generation tool




Chatbot – Fundamentals


Chat Memory:
  • Feature in chatbot systems.
  • Remembers past interactions and context.
  • Enables personalized responses.


AI-powered chat functionalities





Retrieval-Augmented Generation


RAG is a technique for augmenting LLM knowledge with additional data.

RAG Architecture:
  • Indexing: Data ingestion and indexing pipeline from a source, typically done offline.
  • Retrieval and generation: The RAG chain component that retrieves relevant data based on user queries from the index and passes it to the model at runtime.



Embedding Generation




Contextual Question Handling and Retrieval-Based QA System



https://github.com/manaspratimdas/GenAIwithPy/tree/main/04-question-answer-module/042-pdf



LangChain Agents


  • Agents use LLMs as reasoning engines for decision-making.
  • They execute actions based on the LLM outputs.
  • Results from actions can influence further decision-making by the LLM.




Setup:

  • OpenAI Account Setup
  • Install Python 3.11.0 
  • pip3 install pipenv 
  • pipenv install
  • pipenv shell












Monday, 30 December 2024

Database Profiling

 







 SQL Column Profiling


1. Count of NULL values:   

   SELECT COUNT(*) - COUNT(column_name) AS null_count
   FROM table_name;  

2. Distinct values and their counts:   

   SELECT column_name, COUNT(*)
   FROM table_name
   GROUP BY column_name;  

3. Minimum and maximum values:   

   SELECT MIN(column_name) AS min_value, MAX(column_name) AS max_value
   FROM table_name;  

4. Average and standard deviation:   

   SELECT AVG(column_name) AS avg_value, STDDEV(column_name) AS stddev_value
   FROM table_name;  

5. String length distribution:   

   SELECT LENGTH(column_name) AS length, COUNT(*)
   FROM table_name
   GROUP BY LENGTH(column_name);


SQL Cross-Column Profiling


1. Correlation between two numeric columns:   

   SELECT CORR(column1, column2) AS correlation
   FROM table_name;  

2. Finding unique combinations of values across columns:   

   SELECT column1, column2, COUNT(*)
   FROM table_name
   GROUP BY column1, column2; 

3. Detecting functional dependencies:   

   SELECT column1, column2, COUNT(DISTINCT column2) AS unique_values
   FROM table_name
   GROUP BY column1
   HAVING COUNT(DISTINCT column2) = 1; 

4. Checking for null values across multiple columns:   

   SELECT COUNT(*) AS null_count
   FROM table_name
   WHERE column1 IS NULL OR column2 IS NULL; 


SQL Cross-Table Profiling

1. Foreign Key Relationships:   

   SELECT 
       tc.table_schema, 
       tc.table_name, 
       kcu.column_name, 
       ccu.table_schema AS foreign_table_schema,
       ccu.table_name AS foreign_table_name,
       ccu.column_name AS foreign_column_name 
   FROM 
       information_schema.table_constraints AS tc 
       JOIN information_schema.key_column_usage AS kcu
         ON tc.constraint_name = kcu.constraint_name
         AND tc.table_schema = kcu.table_schema
       JOIN information_schema.constraint_column_usage AS ccu
         ON ccu.constraint_name = tc.constraint_name
         AND ccu.table_schema = tc.table_schema
   WHERE tc.constraint_type = 'FOREIGN KEY';   

2. Join Analysis:   

   SELECT 
       t1.column1, t2.column2, COUNT(*)
   FROM 
       table1 t1
       JOIN table2 t2 ON t1.common_column = t2.common_column
   GROUP BY 
       t1.column1, t2.column2;   

3. Referential Integrity Checks:   

   SELECT 
       t1.common_column
   FROM 
       table1 t1
   LEFT JOIN 
       table2 t2 ON t1.common_column = t2.common_column
   WHERE 
       t2.common_column IS NULL;


SQL Data Rule Validation Profiling


1. Check for non-null values:   

   SELECT COUNT(*) AS null_count
   FROM table_name
   WHERE column_name IS NULL;  

2. Check for unique values:  

   SELECT column_name, COUNT(*)
   FROM table_name
   GROUP BY column_name
   HAVING COUNT(*) > 1;  

3. Check for valid ranges:   

   SELECT *
   FROM table_name
   WHERE column_name < min_value OR column_name > max_value;   

4. Check for specific patterns (e.g., email format):   

   SELECT *
   FROM table_name
   WHERE column_name NOT LIKE '%_@__%.__%';   

5. Check for foreign key constraints:   

   SELECT t1.*
   FROM table1 t1
   LEFT JOIN table2 t2 ON t1.foreign_key = t2.primary_key
   WHERE t2.primary_key IS NULL;

SQL Cardinality

1. Count Distinct Values:  

   SELECT COUNT(DISTINCT column_name) AS cardinality
   FROM table_name;   

2. High Cardinality Example:   

   SELECT column_name, COUNT(*) AS frequency
   FROM table_name
   GROUP BY column_name
   ORDER BY frequency DESC;  

3. Low Cardinality Example:   

   SELECT column_name, COUNT(*) AS frequency
   FROM table_name
   GROUP BY column_name
   HAVING COUNT(*) > 1
   ORDER BY frequency DESC;

References


 

Friday, 27 December 2024

Data Governance

 

Data governance is a set of processes, policies, and standards that ensure data is secure, accurate, and usable.


Data Governance Framework

The DAMA-DMBOK (Data Management Body of Knowledge) is a comprehensive framework developed by DAMA International to guide organizations in managing data as a strategic asset. It provides best practices and structured approaches across various aspects of data management, ensuring data quality, accessibility, and compliance



Case Study- Hospitality Domain






Data Inventory and Classification




Data Quality Management





Data Security and Privacy





Data Integration and Interoperability






Data Access and Usage






Reference:


Tuesday, 24 December 2024

Domain-Driven Design for Hotel Management System

Core Domain

The core domain represents the most critical and unique aspects of the hotel management system that provide competitive advantage.


Booking Management

  • Bounded Context: Handles room availability, reservations, and cancellations.
  • Ubiquitous Language: Booking, Reservation, Availability, Cancellation, Check-in, Check-out.


Guest Management

  • Bounded Context: Manages guest profiles, preferences, and loyalty programs.
  • Ubiquitous Language: Guest Profile, Loyalty Points, Preferences, Membership, Rewards.


Payment Processing

  • Bounded Context: Manages payment transactions, billing, and refunds.
  • Ubiquitous Language: Payment, Billing, Invoice, Refund, Transaction, Payment Gateway.


Supporting Domain

The supporting domain includes functionalities that are important but not unique to the hotel management system.

Customer Support

  • Bounded Context: Handles guest inquiries, complaints, and support tickets.
  • Ubiquitous Language: Support Ticket, Inquiry, Complaint, Resolution, Live Chat, Help Desk.


Housekeeping Management

  • Bounded Context: Manages housekeeping schedules, tasks, and inventory.
  • Ubiquitous Language: Housekeeping Schedule, Task, Inventory, Cleaning, Maintenance.


Event Management

  • Bounded Context: Manages event bookings, scheduling, and coordination.
  • Ubiquitous Language: Event Booking, Schedule, Coordination, Venue, Catering.



Generic Domain

The generic domain includes functionalities that are common across many systems and can be outsourced or reused.


Authentication and Authorization

  • Bounded Context: Manages user authentication, roles, and permissions.
  • Ubiquitous Language: User, Role, Permission, Authentication, Authorization, Login, Access Control.


Reporting and Analytics

  • Bounded Context: Generates reports and provides analytics on system usage and performance.
  • Ubiquitous Language: Report, Analytics, Dashboard, Metrics, KPI, Data Visualization.


Notification Service

  • Bounded Context: Manages sending notifications via email, SMS, and push notifications.
  • Ubiquitous Language: Notification, Email, SMS, Push Notification, Alert, Message.

Documentation for Domain-Driven Design for Hotel Management System

Streaming with Kafka API

The Kafka Streams API is a Java library for building real-time applications and microservices that efficiently process and analyze large-sca...