Java Simplicity with Manas

Wednesday, 15 January 2025

Streaming with Kafka API

The Kafka Streams API is a Java library for building real-time applications and microservices that efficiently process and analyze large-scale data streams using Kafka's distributed architecture.

Key Feature

Scalability and Fault Tolerance: Highly scalable and fault-tolerant, ensuring reliable data processing.
Stateful and Stateless Processing: Supports both stateful operations (e.g., aggregations and joins) and stateless operations (e.g., filtering and mapping).
Event-Time Processing: Handles event-time processing, crucial for applications needing to process events based on their occurrence time.
Integration with Kafka: Seamlessly integrates with Kafka, allowing consumption from and production to Kafka topics.

Use Cases

Real-Time Analytics: Analyze data in real-time for insights and decision-making.
Monitoring and Alerting: Monitor systems and trigger alerts based on real-time data.
Data Transformation: Transform and enrich data streams before storing or further processing.

Topology

A topology in Kafka Streams defines the computational logic of your application as a directed graph of processors and streams, specifying how input data is transformed into output data.

Processor

A processor in Kafka Streams is a node within the topology that represents a single processing step. Processors can be either stateless or stateful:

Processors use the Processor API to implement custom logic with methods like process(), init(), and close() for handling records, initialization, and cleanup. State is managed using state stores

State stores

In Kafka Streams, state stores are local, queryable storage engines that maintain state information for stateful operations, ensuring durability, fault tolerance, and data consistency

KStream

A KStream is an abstraction in Kafka Streams representing a continuous stream of records, where each record is a key-value pair. It is used for processing and transforming data in real-time.

KStream provides access to all records in a Kafka topic, treating each event independently.
Each event is processed by the entire topology and is immediately available to the KStream.
KStream, also known as a record stream or log, represents an infinite stream of records, similar to inserts in a database table.

Key Features of KStream

Record Stream: Represents an unbounded, continuous flow of records.
Transformations & Joins: Supports filtering, mapping, flat-mapping, and joining with KStreams, KTables, and GlobalKTables.
Aggregations & Fault Tolerance: Allows for aggregations, windowed operations, and ensures fault tolerance through Kafka's architecture.

How Kafka Streams executes the Topology ?

Serialization / Deserialization

Serdes is the factory class in Kafka Streams that takes care of handling serialization and deserialization of key and value.

Error Handlers in Kafka Streams

KTable

KTable is an abstraction in Kafka Streams which holds latest value for a given key.

Update-Stream/Change Log: KTable represents the latest value for a given key in a Kafka record.
Key-Based Updates: Records with the same key update the previous value; records without a key are ignored.
Relational DB Analogy: Similar to an update operation in a table for a given primary key.

Streaming with Kafka API

Ref: https://github.com/manaspratimdas/vehicle

Friday, 10 January 2025

Request for Proposal (RFP) for Software Development

Purpose of RFPs:

Engage suitable vendors by eliminating ambiguities in requirements and deliverables.
Essential for companies lacking expertise in creating such documents.

What is an RFP for Software Development?:

An initial document outlining the software project, inviting vendors to submit proposals.
Ensures project transparency and mitigates risks of contracting unsuitable providers.
Simplifies establishing business partnerships and aligning expectations.

Who Should Write an RFP?:

Involve relevant members like product owners, project managers, business analysts, and engineers.
In small startups, CEOs or co-founders may create RFPs; in larger organizations, executive managers or procurement professionals handle this.

Benefits of RFPs for Software Development:

Select the right software development company.
Clarify project requirements.
Save time by minimizing repetitive questions.
Ensure transparent contracts.

Steps in the RFP Process:

Step 1: Executive Summary: Include project overview, company description, and project goals.
Step 2: Project Scope: Detail project management, infrastructure, functional, quality assurance, and platform requirements.
Step 3: Timeline: Set priorities and deadlines.
Step 4: Pricing Model and Budget: Outline costs and budget considerations.
Step 5: Vendor Bids and Selection Criteria: Establish evaluation criteria and streamline bid comparison.

Tips for Effective RFPs:

Keep RFPs clear and concise.
Prioritize value over low prices.
Present pain points rather than prescribed solutions.
Include a list of required features.
Preselect 3-5 companies to avoid information overload.

Request for Proposal (RFP) for Hotel Management System Modernization - A Case Study

Hotel MPD International, a well-established hotel with over 10 years in the industry, seeks to modernize its hotel management system. The current system is a monolithic application deployed on-premises using legacy technology. The goal is to transition to a modern, cloud-based system utilizing the latest technology to enhance efficiency, scalability, and guest experience.

Details here : https://github.com/manaspratimdas/hms/blob/main/RFP/01_rfp_modernization.md

Wednesday, 8 January 2025

Minimizing Toil & Wastage in Software Development

Toil

Definition: Toil refers to tasks that are manual, repetitive, automatable, tactical, and devoid of enduring value. These tasks scale linearly with the growth of the service and do not contribute to long-term improvements.
Examples: Routine maintenance, manual deployments, repetitive testing, and handling alerts manually.

Wastage

Definition: Wastage involves any activity that does not add value to the end product or service. This includes inefficient processes, unnecessary steps, and delays.
Examples: Waiting for approvals, redundant meetings, and excessive debugging due to poor code quality.

Smart CI-CD Pipeline

Continuous Integration (CI)

Linting: Analyzes source code to flag errors and stylistic issues, ensuring code quality and consistency.
Unit Test Coverage: Verifies individual components work as intended, detecting issues early.
Static Application Security Testing (SAST): Analyzes code for security vulnerabilities without execution, identifying security issues early.
Build: Compiles source code into executable artifacts, ensuring code is ready for testing and deployment.
Publish: Stores built artifacts in a repository, centralizing management and distribution of artifacts.

Continuous Delivery (CD)

Deploy to Non-Prod Environments: Deploys application to various environments iteratively, ensuring thorough testing in production-like environments.
Dynamic Application Security Testing (DAST): Identifies security vulnerabilities in web applications by simulating real-world attacks, providing a realistic assessment of security.
Regression Testing: Ensures new code changes do not adversely affect existing functionality, maintaining software stability.
Performance Testing: Evaluates the speed, responsiveness, and stability of a system under a given workload, enhancing user satisfaction.

Left Shift Strategy

Purpose: Integrates testing, quality assurance early.
Benefits: Early detection, improved collaboration, enhanced quality.
Examples: Continuous Integration (CI), Test-Driven Development (TDD), Static Code Analysis, Pair Programming, Automated Unit Testing, optimizing inner loop

Security left shift example

Automate the Code Review

Purpose: Maintain code quality and consistency using tools and AI.

Benefits:

Time Savings: Automates routine checks, freeing up developers' time.
Consistency: Ensures consistent enforcement of coding standards.
Improved Quality: Identifies potential issues and optimizations early.

Bitbucket / Github

Pull Request Templates: Standardize PRs with necessary information.
Static Code Analysis Tools: Use SonarQube or CodeClimate for quality and security checks.
Automate Reviewer Assignment: Assign reviewers automatically based on code changes.
Merge Checks: Enforce quality requirements before merging PRs.

Bulk Build & Deployment

Bulk Build & Deployment is utilized in Release Management by some organizations to streamline the release process, where different teams build and deploy a large number of applications on a specific day of the sprint.

Challenges : Coordination Complexity, Build Quality, Resource Allocation, Visibility and Tracking, Error Handling, Environment Consistency, Automated Testing, Rollback Procedures.

Automate Creation of the Branch for Bulk Apps for Every Sprint

Reduces manual effort and ensures consistency.
Speeds up the initial setup process.

Automate PR Creation

Streamlines code integration and reduces administrative tasks.
Ensures standardized and error-free PRs.

Automate Merging of PR

Speeds up the integration process and reduces repetitive tasks.
Ensures consistent and error-free merges.

Scheduled Automate Build and Deployment in Non-Production

Ensures regular testing and deployment, catching issues early.
Reduces manual effort and provides continuous feedback.

Bulk Branch Deletion of Not Used Branches After the Release

Keeps the repository clean and organized.
Frees up resources and reduces manual cleanup tasks.

Monitoring and Alerting

Purpose: Track the performance and health of applications.

Benefits:

Proactive Issue Resolution: Detects and resolves issues before they impact users.
Improved Reliability: Ensures the system remains reliable and available.
Data-Driven Decisions: Provides insights for informed system improvements.

Scripting and scheduling

Cleanup Script: Automates removal of temporary files and old logs to free up disk space.
Space Check Script: Monitors disk space usage and alerts when thresholds are exceeded.
Installation Script: Automates software installation and configuration for consistency and efficiency.
Log Management and Rotation: Manages and rotates log files to prevent excessive disk space usage.
New Repository Creation Script with Predefined Rules: Automates creation of new repositories with predefined structures and rules.

Saturday, 4 January 2025

GenAI Fundamentals with LangChain

LangChain

This is a Framework for developing applications powered by large language models (LLMs)

LLM : Machine learning that can comprehend and generate human language text. They work by analyzing massive data sets of language.

Prompts & Prompt Chaining

Prompt: Prompt is an input that a user provides to an AI model to get a specific response

PromptTemplate

Prompt Chaining: Sequential prompts enhance model coherence and structure

Sequential Chain

Code snippet and test generation tool

https://github.com/manaspratimdas/GenAIwithPy/tree/main/02-code-generation-tool

Chatbot – Fundamentals

Chat Memory:

Feature in chatbot systems.
Remembers past interactions and context.
Enables personalized responses.

AI-powered chat functionalities

https://github.com/manaspratimdas/GenAIwithPy/tree/main/03-chat-app

Retrieval-Augmented Generation

RAG is a technique for augmenting LLM knowledge with additional data.

RAG Architecture:

Indexing: Data ingestion and indexing pipeline from a source, typically done offline.
Retrieval and generation: The RAG chain component that retrieves relevant data based on user queries from the index and passes it to the model at runtime.

Embedding Generation

Contextual Question Handling and Retrieval-Based QA System

https://github.com/manaspratimdas/GenAIwithPy/tree/main/04-question-answer-module/042-pdf

https://github.com/manaspratimdas/GenAIwithPy/tree/main/05-text-processing

LangChain Agents

Agents use LLMs as reasoning engines for decision-making.
They execute actions based on the LLM outputs.
Results from actions can influence further decision-making by the LLM.

Setup:

OpenAI Account Setup
Install Python 3.11.0
pip3 install pipenv
pipenv install
pipenv shell

Ref:

https://marcelclasses.udemy.com/course/chatgpt-and-langchain-the-complete-developers-masterclass/learn/lecture/41649584#overview

Monday, 30 December 2024

Database Profiling

SQL Column Profiling

1. Count of NULL values:

SELECT COUNT(*) - COUNT(column_name) AS null_count
FROM table_name;

2. Distinct values and their counts:

SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name;

3. Minimum and maximum values:

SELECT MIN(column_name) AS min_value, MAX(column_name) AS max_value
FROM table_name;

4. Average and standard deviation:

SELECT AVG(column_name) AS avg_value, STDDEV(column_name) AS stddev_value
FROM table_name;

5. String length distribution:

SELECT LENGTH(column_name) AS length, COUNT(*)
FROM table_name
GROUP BY LENGTH(column_name);

SQL Cross-Column Profiling

1. Correlation between two numeric columns:

SELECT CORR(column1, column2) AS correlation
FROM table_name;

2. Finding unique combinations of values across columns:

SELECT column1, column2, COUNT(*)
FROM table_name
GROUP BY column1, column2;

3. Detecting functional dependencies:

SELECT column1, column2, COUNT(DISTINCT column2) AS unique_values
FROM table_name
GROUP BY column1
HAVING COUNT(DISTINCT column2) = 1;

4. Checking for null values across multiple columns:

SELECT COUNT(*) AS null_count
FROM table_name
WHERE column1 IS NULL OR column2 IS NULL;

SQL Cross-Table Profiling

1. Foreign Key Relationships:

SELECT
tc.table_schema,
tc.table_name,
kcu.column_name,
ccu.table_schema AS foreign_table_schema,
ccu.table_name AS foreign_table_name,
ccu.column_name AS foreign_column_name
FROM
information_schema.table_constraints AS tc
JOIN information_schema.key_column_usage AS kcu
ON tc.constraint_name = kcu.constraint_name
AND tc.table_schema = kcu.table_schema
JOIN information_schema.constraint_column_usage AS ccu
ON ccu.constraint_name = tc.constraint_name
AND ccu.table_schema = tc.table_schema
WHERE tc.constraint_type = 'FOREIGN KEY';

2. Join Analysis:

SELECT
t1.column1, t2.column2, COUNT(*)
FROM
table1 t1
JOIN table2 t2 ON t1.common_column = t2.common_column
GROUP BY
t1.column1, t2.column2;

3. Referential Integrity Checks:

SELECT
t1.common_column
FROM
table1 t1
LEFT JOIN
table2 t2 ON t1.common_column = t2.common_column
WHERE
t2.common_column IS NULL;

SQL Data Rule Validation Profiling

1. Check for non-null values:

SELECT COUNT(*) AS null_count
FROM table_name
WHERE column_name IS NULL;

2. Check for unique values:

SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name
HAVING COUNT(*) > 1;

3. Check for valid ranges:

SELECT *
FROM table_name
WHERE column_name < min_value OR column_name > max_value;

4. Check for specific patterns (e.g., email format):

SELECT *
FROM table_name
WHERE column_name NOT LIKE '%_@__%.__%';

5. Check for foreign key constraints:

SELECT t1.*
FROM table1 t1
LEFT JOIN table2 t2 ON t1.foreign_key = t2.primary_key
WHERE t2.primary_key IS NULL;

SQL Cardinality

1. Count Distinct Values:

SELECT COUNT(DISTINCT column_name) AS cardinality
FROM table_name;

2. High Cardinality Example:

SELECT column_name, COUNT(*) AS frequency
FROM table_name
GROUP BY column_name
ORDER BY frequency DESC;

3. Low Cardinality Example:

SELECT column_name, COUNT(*) AS frequency
FROM table_name
GROUP BY column_name
HAVING COUNT(*) > 1
ORDER BY frequency DESC;

References

https://www.ibm.com/think/topics/data-profiling

https://www.talend.com/resources/what-is-data-profiling/

https://nphchi223.medium.com/all-about-data-profiling-in-sql-582a0f250d75

Friday, 27 December 2024

Data Governance

Data governance is a set of processes, policies, and standards that ensure data is secure, accurate, and usable.

Data Governance Framework

The DAMA-DMBOK (Data Management Body of Knowledge) is a comprehensive framework developed by DAMA International to guide organizations in managing data as a strategic asset. It provides best practices and structured approaches across various aspects of data management, ensuring data quality, accessibility, and compliance

Case Study- Hospitality Domain

Data Inventory and Classification

Data Quality Management

Data Security and Privacy

Data Integration and Interoperability

Data Access and Usage

Reference:

https://www.youtube.com/watch?v=uPsUjKLHLAg

https://www.youtube.com/watch?v=cRmI_Kkrb8E

https://atlan.com/know/data-governance/data-compliance-management-in-hospitality/

https://atlan.com/data-governance-framework/#2-dama-dmbok

Tuesday, 24 December 2024

Domain-Driven Design for Hotel Management System

Core Domain

The core domain represents the most critical and unique aspects of the hotel management system that provide competitive advantage.

Booking Management

Bounded Context: Handles room availability, reservations, and cancellations.
Ubiquitous Language: Booking, Reservation, Availability, Cancellation, Check-in, Check-out.

Guest Management

Bounded Context: Manages guest profiles, preferences, and loyalty programs.
Ubiquitous Language: Guest Profile, Loyalty Points, Preferences, Membership, Rewards.

Payment Processing

Bounded Context: Manages payment transactions, billing, and refunds.
Ubiquitous Language: Payment, Billing, Invoice, Refund, Transaction, Payment Gateway.

Supporting Domain

The supporting domain includes functionalities that are important but not unique to the hotel management system.

Customer Support

Bounded Context: Handles guest inquiries, complaints, and support tickets.
Ubiquitous Language: Support Ticket, Inquiry, Complaint, Resolution, Live Chat, Help Desk.

Housekeeping Management

Bounded Context: Manages housekeeping schedules, tasks, and inventory.
Ubiquitous Language: Housekeeping Schedule, Task, Inventory, Cleaning, Maintenance.

Event Management

Bounded Context: Manages event bookings, scheduling, and coordination.
Ubiquitous Language: Event Booking, Schedule, Coordination, Venue, Catering.

Generic Domain

The generic domain includes functionalities that are common across many systems and can be outsourced or reused.

Authentication and Authorization

Bounded Context: Manages user authentication, roles, and permissions.
Ubiquitous Language: User, Role, Permission, Authentication, Authorization, Login, Access Control.

Reporting and Analytics

Bounded Context: Generates reports and provides analytics on system usage and performance.
Ubiquitous Language: Report, Analytics, Dashboard, Metrics, KPI, Data Visualization.

Notification Service

Bounded Context: Manages sending notifications via email, SMS, and push notifications.
Ubiquitous Language: Notification, Email, SMS, Push Notification, Alert, Message.

Documentation for Domain-Driven Design for Hotel Management System

https://github.com/manaspratimdas/hms/tree/main/viewpoints-and-artifacts/DomainDrivenDesign