datablogs

Wednesday, February 25, 2026

Mongodb crashes due to too many open files error

Problem 

When running MongoDB in production, configuring the open file descriptor limit (nofile) is critical. Ubuntu's default limit (1024) is insufficient for production workloads and may result in connection failures or 'Too many open files' errors under high load.

Wherever we are running production workloads like AWS EC2 , Azure VM , GCP Cloud and DigitalOcean it will be same problem 


Why Open File Limits Matter MongoDB uses file descriptors for :

  • Client connections
  • WiredTiger data files
  • Journal files
  • Log files
  • Internal sockets
Each active connection and file consumes a descriptor. High-traffic environments require a significantly higher limit.

Recommended Production Value Set the open file limit to 1048576

This is typically the maximum allowed by the Ubuntu kernel and is effectively unlimited for most real-world workloads.

Configuration Steps (systemd)

Create or edit the systemd override file:

sudo systemctl edit mongod or vi /etc/systemd/system/mongod.service.d/override.conf

Add the following configuration:

[Service]
LimitNOFILE=1048576

Reload and restart MongoDB:

   sudo systemctl daemon-reexec
   sudo systemctl daemon-reload
   sudo systemctl restart mongod
Also Change limits configuration ,

vi /etc/security/limits.conf
mongodb soft nofile unlimited
mongodb hard nofile unlimited
root soft nofile unlimited
root hard nofile unlimited
Verification 

cat /proc/$(pidof mongod)/limits | grep "open files"

Expected Output 


Kernel Limitation 

Linux does not support true unlimited file descriptors. Even when setting LimitNOFILE=infinity, the value is capped by the kernel parameter fs.nr_open.On Ubuntu, this is commonly 1048576.

Final Recommendation 

For MongoDB production environments ,
  • Set LimitNOFILE to 640000 or 1048576
  • Ensure fs.nr_open and fs.file-max are properly configured
  • Always verify after restart
Proper OS tuning ensures stability, prevents resource exhaustion, and maintains optimal MongoDB performance under high concurrency.

Tuesday, February 24, 2026

What is Vector Databases: Powering the Next Wave of AI-Driven Experiences with AWS

In today’s AI-first world, the ability to understand and search unstructured data - like text, images, and audio - is transforming how applications deliver value. Traditional databases excel at exact matches and structured queries, but to enable semantic search, recommendations, and intelligent retrieval, we need something more powerful: vector databases.

What Is a Vector Database?

A vector database is a specialized system designed to store and query high-dimensional vectors - numerical representations (embeddings) of data generated by machine learning models. Instead of searching by exact keywords, these databases find the “nearest neighbors” in multi-dimensional space, enabling applications to retrieve items that are semantically similar to a query.

Why Vector Databases Matter

As organizations handle massive volumes of unstructured and semi-structured data, vector databases provide:

  • Efficient similarity search, enabling semantic understanding rather than simple keyword matching.
  • Operationalization of embeddings, letting developers index and query vectors as part of real-world applications.
  • Enterprise-grade capabilities such as security, scalability, and high availability.
By accelerating vector-based search and retrieval, these databases unlock richer AI experiences - from conversational agents to personalized recommendations.

Vector Databases on AWS

AWS provides multiple services to support vector-driven applications:

  • Amazon OpenSearch Service – Enables scalable, high-performance semantic and hybrid search.
  • Amazon Aurora PostgreSQL & Amazon RDS with pgvector – Store embeddings and perform similarity search within relational databases.
  • Amazon MemoryDB and Amazon DocumentDB – Offer vector search capabilities for high-throughput and document-centric workloads.
  • Amazon S3 Vectors – Native vector storage designed for large-scale AI workloads.
Real-World Use Cases

With vector databases, organizations can build:

  • Semantic search engines that understand user intent.
  • AI-powered chatbots grounded in enterprise knowledge (RAG architectures).
  • Recommendation systems based on contextual similarity.
  • Multimodal search combining text, images, and audio.
Conclusion

Vector databases are becoming an essential component of modern AI infrastructure. By transforming embeddings into actionable intelligence, they help businesses deliver smarter, faster, and more personalized user experiences. With AWS offering a wide range of vector-capable solutions, organizations can confidently adopt semantic search and advanced AI capabilities at scale.

Next we will see how to implement in AWS Services 

Thursday, February 19, 2026

Wednesday, February 4, 2026

AWS Glue is always Powerful as Data Engineer Trust ?

AWS Glue is a powerful data engineering platform when designed, tuned, and governed correctly.But We are treating it as a simple ETL utility often leads to cost, performance, and reliability issues.

As a Data Friend we need to have solid understanding on the AWS Glue 

Myth 1: AWS Glue is only for simple ETL

Reality:

AWS Glue supports complex transformations including joins, aggregations, schema evolution, incremental processing, and large-scale distributed processing using Apache Spark. It is suitable for enterprise-grade data engineering workloads.

Myth 2: AWS Glue is serverless, so performance tuning is not required

Reality:

While infrastructure management is serverless, Glue jobs still require tuning

  • Worker types (G.1X, G.2X, G.4X)
  • Number of DPUs
  • Spark configurations
  • Partitioning and data layout
  • Poor tuning leads to high cost and slow execution.

Myth 3: AWS Glue works only with Amazon S3

Reality:

AWS Glue integrates with multiple data sources

  • Amazon RDS and Aurora
  • Amazon Redshift
  • DynamoDB
  • JDBC sources (Oracle, SQL Server, MySQL, PostgreSQL)
  • Streaming sources such as Kafka and Kinesis

Myth 4: AWS Glue is very expensive

Reality:

Glue becomes expensive mainly due to design issues

  • Over-provisioned DPUs
  • Full data reloads instead of incremental loads
  • Missing job bookmarks

With optimized design, Glue is often more cost-effective than always-on Spark clusters.

Myth 5: Glue Crawlers automatically handle schema management

Reality:

Crawlers may

  • Create excessive tables
  • Misinterpret schema changes
  • Perform poorly with nested or semi-structured data

Production systems typically require controlled schema management and governance.

Myth 6: AWS Glue replaces data warehouses

Reality:

AWS Glue is a data integration and transformation service. It complements data warehouses by preparing and transforming data before loading into analytics platforms.

Myth 7: Glue jobs are difficult to debug

Reality:

Glue supports debugging through

  • Amazon CloudWatch logs
  • Spark UI
  • Job bookmarks
  • Glue Studio monitoring

Most challenges arise from limited Spark expertise rather than Glue itself.

Myth 8: AWS Glue supports only batch processing

Reality:

AWS Glue also supports

  • Streaming ETL
  • Near real-time pipelines
  • Event-driven processing

It is not limited to scheduled batch workloads.

Myth 9: AWS Glue is a set-and-forget service

Reality:

Production Glue pipelines require

  • Cost and performance monitoring
  • Schema change handling
  • Failure alerts and retries
  • Version control and CI/CD

Glue jobs should be treated as production-grade software.

Myth 10: AWS Glue is only for data engineers

Reality:

With Glue Studio, SQL-based transformations, and visual workflows, Glue can be effectively used by analytics teams, architects, and platform teams.

If you having issues , Please connect with us for instant help !!!

Tuesday, December 23, 2025

Steps to Change TempDB Location in SQL Server

Its very easy and simple , Just sharing you the steps without so much theories 

Why we are moving it , when its in C Drive or SQL Data Drive it will cause IO issues , so that we are moving TempDB files into different location 

Step 1: Check Current TempDB Files Location

Run the following query:
USE tempdb;
EXEC sp_helpfile;
Step 2: Plan New Location

- Create a new folder on the desired drive (e.g., D:\SQLData\TempDB).
- Ensure SQL Server service account has Full Control on that folder.

Step 3: Modify TempDB File Paths

Run these commands (adjust paths as needed):
USE master;
ALTER DATABASE tempdb MODIFY FILE (NAME = tempdev, FILENAME =
'D:\SQLData\TempDB\tempdb.mdf');
ALTER DATABASE tempdb MODIFY FILE (NAME = templog, FILENAME =
'D:\SQLData\TempDB\templog.ldf');
Repeat for additional files if they exist.

Step 4: Restart SQL Server Service

Stop and Start the SQL Server instance. Restart is mandatory for changes to take effect.

Step 5: Verify New Location
USE tempdb;
EXEC sp_helpfile;
This should show the new paths.

Step 6: Clean Up Old Files

After SQL Server restarts successfully and TempDB is created in the new location, manually delete
the old tempdb files from the old drive.

SQL Server High CPU Issue Caused by Implicit Conversion (nvarchar vs varchar)

Problem Overview

Recently, we encountered a high CPU utilization issue on a SQL Server instance — even though only one small SELECT query was executing.At first glance, it seemed unusual that such a simple query could consume so much CPU. However, on investigating the execution plan, we noticed that the optimizer was performing a conversion from varchar to nvarchar in the predicate.

Root Cause: Implicit Data Type Conversion

In SQL Server, when a query compares two different data types (for example, a column defined as varchar and a variable or literal passed as nvarchar), SQL Server performs an implicit conversion to make them compatible.

Unfortunately, this can:
  • Prevent index seeks (forcing a full scan instead)
  • Increase CPU usage
  • Cause plan cache bloat (multiple plans for same query)
  • Lead to unpredictable performance degradation
Example:

-- Table definition
CREATE TABLE Customer (
CustomerID INT,
CustomerCode VARCHAR(20)
);
-- Query from application
SELECT * FROM Customer WHERE CustomerCode = N'CUST001'; --
Notice the 'N' prefix (nvarchar)Even though an index exists on CustomerCode, SQL Server converts the column during
execution:
CONVERT_IMPLICIT(nvarchar(4000), [CustomerCode], 0)
As a result, the optimizer cannot perform an index seek, leading to a full table scan and
high CPU utilization.

Investigation Steps
  1. Checked sys.dm_exec_requests and sys.dm_exec_query_stats – saw high CPU for a single query.
  2. Reviewed the Actual Execution Plan – found CONVERT_IMPLICIT on the predicate.
  3. Confirmed column type: varchar(20)
  4. Confirmed query parameter or literal type: nvarchar
Resolution
  • Since the customer application could not modify the query, we proposed aligning the data types at the table level.
  • ALTER TABLE Customer ALTER COLUMN CustomerCode NVARCHAR(20);
  • After modifying the column data type to match the query, the optimizer was able to use the index correctly.
Result:

 CPU utilization dropped significantly
 Query execution time improved
 No functional impact

Best Practices to Avoid Implicit Conversions
  • Keep data types consistent between table columns and application parameters.
  • Avoid using the Unicode prefix (N'...') unless necessary.
  • Use parameter sniffing tests to identify mismatched parameter types.
  • In critical systems, standardize column data types across schemas and applications.
Summary

Even a small implicit conversion can lead to large performance issues in SQL Server.By ensuring data type alignment between the database schema and query parameters,you can prevent unnecessary scans and CPU spikes.