A data lake is a centralised storage repository, typically on cloud object storage, that holds raw data in its native format for later analytics, machine learning and compliance reporting. It supports schema-on-read, allowing multiple tools to consume the same data without re-ingestion.

How does a data lake differ from a data warehouse?

A warehouse stores curated, modelled data optimised for SQL analytics. A lake stores raw data flexibly and supports analytics, ML and streaming on the same files. Many modern Indian enterprises run both, with the warehouse as a curated layer fed from the lake.

What use cases suit a data lake in India?

GST analytics, vendor risk scoring, MCA V3 due diligence, customer 360, IoT telemetry, fraud detection and unstructured contract analytics all benefit from a lake's flexibility, lower cost and ability to hold structured and unstructured data together with consistent governance and retention controls.

How is a data lake governed under the DPDP Act?

Maintain a dataset catalog with owner, sensitivity classification and retention, apply column-level masking to personal data, log every access, document the lawful basis under the DPDP Act, 2023 and its 2025 rules, and align retention to the strictest applicable Indian statute.

Data Lake in Modern Business

No content-writing skill is available in this environment, so I'll proceed directly with the blog post.

Data Lake in Modern Business

A data lake is a centralised storage repository — typically on object storage such as Azure ADLS Gen2, AWS S3 or GCP Cloud Storage — that holds data in its native format: structured, semi-structured and unstructured, all in one place. For Indian businesses in FY 2026-27, it has become the default architecture for unifying GST returns, ERP exports, MCA V3 filings and operational data under one governed platform — at 70–85% lower storage cost than a traditional warehouse — and with the flexibility that both modern ML workloads and the Digital Personal Data Protection Act, 2023 (DPDP Act) now demand.

What a Data Lake Is — and What It Is Not

The fastest way to understand a data lake is to contrast it with two things people confuse it with: the data warehouse and the data swamp.

A traditional data warehouse — Oracle, SQL Server Analysis Services, Redshift in its classic mode — demands a schema before data arrives. You define tables, data types and relationships upfront: schema on write. This works well when your data is structured, predictable and modest in volume. It breaks down the moment you need to store a GSTR-2B JSON payload alongside a scanned vendor contract PDF alongside a factory IoT sensor stream. Every new source type becomes a migration project.

A data lake inverts this. You store data in its raw, native format — JSON, CSV, Parquet, PDF, audio, video — on cheap object storage. Schema is applied at read time by the query engine, not at write time by the storage layer. A SQL analyst queries it through Azure Synapse or Amazon Athena. A data scientist reads the same files into a Python notebook. A compliance tool scans for PII. One copy of data, many consumption patterns — no duplication, no ETL every time a new use case emerges.

A data swamp is what a data lake becomes when nobody governs it. Files pile up with no owner, no lineage, no retention policy. Analysts cannot trust what they find. Auditors cannot say who accessed what. The DPDP Board asks which personal data was processed and when — and nobody can answer. This is the failure mode that governance prevents, and it is entirely avoidable with the right operating model from the first sprint.

The technical anchor concept is schema-on-read. It gives the lake its flexibility, but it also means you cannot rely on the storage layer to enforce data quality. That job belongs to your ingestion pipeline and your data catalog — which is precisely why governance is inseparable from architecture, not an afterthought you bolt on in year two.

Why Indian Businesses Are Adopting Data Lakes in FY 2026-27

Several forces have converged to make the data lake the natural default for Indian organisations this year.

Volume and variety of compliance data. A mid-size Indian company files GSTR-1, GSTR-3B, GSTR-9 and GSTR-9C annually. It submits e-invoices through the Invoice Registration Portal (IRP) for every B2B transaction above the applicable threshold. It maintains e-way bills, TDS challans (Form 26Q/27Q), advance tax workings and MCA V3 annual returns. Layer on ERP exports, bank statements, payroll data and vendor contracts and you have dozens of structured, semi-structured and unstructured data streams that no single relational warehouse handles cleanly. The data lake absorbs all of them without a schema migration each time the GSTN or MCA changes a field.

Statutory retention obligations create long-lived volume. GST law requires records for 72 months from the due date of the annual return (Rule 56, CGST Rules, 2017). The Income Tax Act, 1961 mandates books of account for 6 years from the end of the relevant assessment year. The Companies Act, 2013 requires maintenance of books for 8 years (Section 128). A data lake that tiers hot data to cooler, cheaper storage as it ages handles multi-year retention at a fraction of the cost of keeping rows alive in a managed relational database.

The cost arithmetic has shifted decisively. Object storage on Azure ADLS Gen2 or AWS S3 costs roughly Rs. 1.5–2 per GB per month on hot tiers and as little as Rs. 0.20–0.40 per GB on cool or archive tiers. Equivalent capacity in a managed relational database runs 5–10× more. At 50 TB of data — easily accumulated across 6 years of GST data, ERP exports and contracts for a Rs. 200 crore turnover company — the annual storage saving alone can exceed Rs. 40 lakh before any compute cost difference is considered.

ML workloads need unstructured data. A contract analytics model needs the original PDF, not a relational summary. A call-centre sentiment model needs audio files or raw transcripts. A vendor risk model needs news feeds and regulatory filings. These cannot live in a row-and-column warehouse; they live naturally in a lake alongside the structured ERP data that provides business context.

India's digital public infrastructure emits evolving JSON. GSTN, MCA V3, Account Aggregator (AA), ONDC and OCEN are all API-driven, schema-flexible systems. A lake absorbs their schema changes without ALTER TABLE scripts. A warehouse needs a migration every time the government adds a field to the e-invoice schema — which happens more often than most finance teams would like.

Reference Architecture: The Three-Zone Model

A well-designed data lake for an Indian business uses three logical zones within the same object storage account or bucket hierarchy. Think of them as: immutable raw → trusted clean → business ready.

Zone 1 — Landing (Raw)

Every source system writes its data here without transformation. Files are immutable once written. Every object carries metadata: source system name, ingest timestamp, pipeline run ID and a data-sensitivity classification tag. A GSTR-2B feed lands as the JSON returned by the GSTN API. An SAP export lands as the exact CSV the system produced. Nothing is changed, nothing discarded. If a downstream transformation turns out to be wrong, you replay from the landing zone. Retention aligns to the strictest applicable statute: 8 years for Companies Act data, 72 months for GST, 6 years for income tax. In practice, keep all landing data for 8 years and tier to archive storage after 3 years.

Zone 2 — Curated (Trusted)

Ingestion pipelines read from landing, apply validation rules, cleanse nulls and duplicates, standardise formats — dates to ISO 8601, currency to INR — and write output as Parquet or Delta Lake format: columnar, compressed, partitioned by date and entity. A GSTR-2B JSON file becomes a partitioned Parquet table with validated GSTIN, invoice number, ITC amount and supplier name columns, queryable by any SQL engine in seconds. Transformation code is version-controlled. Every curated dataset carries lineage: which landing file it was derived from, which pipeline version ran, and when.

Zone 3 — Consumption

Business-facing datasets, aggregated views, BI-ready semantic layers and ML feature stores live here. Power BI connects via Synapse or Fabric. A vendor risk model reads feature vectors from this zone. Nothing in consumption is source-of-truth — disputed figures trace back through lineage to the landing file.

Governance Layer (Horizontal Across All Zones)

A data catalog — Microsoft Purview, Apache Atlas or AWS Glue Data Catalog — registers every dataset with an owner, sensitivity classification, retention date and business description. Access control enforced at the catalog level means a data scientist without finance clearance cannot read salary data in the curated zone even if they know the file path. Every read is logged for audit.

Governing a Data Lake Under the DPDP Act, 2023

The DPDP Act and the rules notified in 2025 apply to any processing of digital personal data about Indian data principals — employees, customers, vendor contacts — and all of these flow into most enterprise data lakes.

Lawful basis and purpose limitation. Every dataset containing personal data must be tagged in the catalog with its processing purpose. Data collected for GST compliance cannot be repurposed for marketing analytics without a separate, documented consent basis. Your access controls must enforce this separation technically, not just as a policy document.

Column-level masking in the curated zone. The raw landing zone will contain full PII from source systems — Aadhaar numbers, PAN numbers, mobile numbers. The curated zone must apply masking or tokenisation before personal data reaches analysts. Only privileged roles with a documented business need access the unmasked landing data, and every such access is logged.

Data principal rights — plan for erasure. Under the DPDP Act, a data principal can request deletion of their personal data. In a traditional warehouse this is a DELETE statement. In a lake you need either a soft delete with a catalog flag (marking data as erased and excluding it from all queries) or a hard delete using Delta Lake or Apache Iceberg's row-level delete capability, which rewrites the affected Parquet files. Build this from day one. Retrofitting it across 3 years of existing files is expensive and risky.

Penalties are material. The Data Protection Board can impose penalties up to Rs. 250 crore for a significant data breach attributable to failure to implement reasonable security safeguards. A lake that logs every access, masks PII at the curated zone and enforces purpose-based access control is a demonstrably stronger compliance position than an ungoverned file share.

Navigating the retention conflict. The DPDP Act requires personal data to be deleted once the processing purpose is fulfilled, but GST and Companies Act require transaction records for years. The practical resolution: retain the financial transaction record — GSTIN, invoice amount, tax amount — for the statutory period, but purge the associated personal identifiers (an individual's name on a contract invoice) once the compliance purpose is met, unless that identifier is itself a legal requirement of the record.

Worked Example: GST Reconciliation Lake for a Mid-Size Manufacturer

Company profile: Auto-components manufacturer, Pune. Annual turnover Rs. 200 crore. Approximately 4,800 e-invoices filed per month through IRP. Approximately 320 vendor purchase invoices claimed as ITC each month.

The problem before the lake: Three finance team members spent approximately 40 hours per month on ITC reconciliation in Excel — downloading GSTR-2B from the GST portal, exporting the purchase ledger from SAP and VLOOKUP-matching on invoice number. In FY 2025-26, an undetected GSTR-2B mismatch of Rs. 22 lakh — supplier had not filed their GSTR-1 in time — resulted in an ITC reversal notice from the department. Interest under Section 50 CGST Act at 18% per annum on Rs. 22 lakh for 8 months = Rs. 2,64,000. A Section 125 penalty in the adjudication order added Rs. 50,000. Total cost of one undetected mismatch: Rs. 3,14,000.

Lake solution implemented — step by step:

GSTN API integration lands GSTR-1, GSTR-2B and GSTR-3B JSON into the landing zone within 24 hours of portal availability each month.
SAP exports (purchase ledger, sales ledger) land as CSV files daily via a scheduled SFTP pull.
An Apache Spark job in the curated zone joins both datasets on the composite key of supplier GSTIN + invoice number + invoice date, writing a reconciliation Parquet table with a status column: MATCHED, ITC_CLAIMED_NOT_IN_2B or IN_2B_NOT_CLAIMED.
A Power BI report built on the consumption zone surfaces all ITC_CLAIMED_NOT_IN_2B rows to the finance team within 48 hours of GSTR-2B generation — typically by the 15th of the month following the tax period.
The team resolves mismatches by following up with suppliers (who re-file or issue credit notes) before the 20th, the GSTR-3B due date.

Outcome: Reconciliation team effort drops from 40 hours to approximately 6 hours per month (3 hours exception handling, 3 hours supplier follow-up). In FY 2026-27 the company has filed four consecutive GSTR-3Bs with zero unresolved mismatches at the filing date.

Infrastructure cost at this scale: Azure ADLS Gen2 for 36 months of data (~800 GB hot + 2.4 TB cool): approximately Rs. 55,000 per annum. Azure Synapse or Databricks compute for monthly batch reconciliation runs: approximately Rs. 1,10,000 per annum. Total: roughly Rs. 1,65,000 per annum — against Rs. 3,14,000 from a single year's penalty exposure, plus the equivalent of approximately 6 FTE-months of manual work reclaimed annually.

Total Cost of Ownership: What the Cloud Bill Actually Looks Like

The most common sticker shock in data lake projects is not the storage bill — it is the query cost. Object storage appears attractively cheap until an analyst runs SELECT * FROM events on an unpartitioned, uncompressed 5 TB table at approximately Rs. 400 per TB scanned. That single query costs Rs. 2,000 and takes 4 minutes. Run it ten times a day and the monthly query bill for one analyst exceeds Rs. 6 lakh.

Build for query cost from the first sprint:

Partition tables by date (year → month → day) and by high-cardinality filter columns (GSTIN, state code). A query for "all Maharashtra invoices in April 2026" scans only the relevant partition — 1% of the data, 1% of the cost.
Use columnar compressed formats. Parquet with Snappy compression typically reduces file size 5–10× compared to raw JSON or CSV and query time 3–5× because the engine reads only the columns requested, not the full row.
Expire stale snapshots. Delta Lake and Iceberg maintain ACID transaction logs and time-travel history. Set a vacuum policy — for example, retain 30 days of history for operational rollback — to prevent the log from accumulating unnecessary storage and slowing metadata operations.
Tier cold data. Data older than 3 years, accessed only for annual audits, moves to cool or archive storage at 40–80% lower unit cost. Audit queries can tolerate the added retrieval latency.
Tag every dataset with a business owner and track cost per business question. When the CFO asks why the cloud bill rose 40% in Q3, you need a dataset-level answer backed by cost-allocation tags — not a "cloud is expensive" shrug.

A well-governed lake for a Rs. 200 crore turnover company typically runs Rs. 15–30 lakh per annum all-in, including storage, compute and catalog licensing. Equivalent capability on-premises — Oracle or SQL Server licensing, hardware, DBA headcount, power and cooling — rarely comes in below Rs. 60–80 lakh per annum at the same data volume. The savings are real; they require governed architecture to materialise.

Common Pitfalls and How to Fix Them

No catalog from day one. Teams say "we'll add metadata later." They never do. Within six months, 400 files exist and nobody knows which is the current version. Fix: Make catalog registration a deployment gate. No dataset lands without owner, sensitivity classification and retention date recorded in the catalog. Automate this check in your CI/CD pipeline.

Landing zone treated as an analytical layer. Analysts query raw JSON directly from the landing zone because the curated zone isn't ready yet. Dirty data propagates into dashboards and reports. Fix: Block direct query access to the landing zone for all roles except ingestion pipelines and designated auditors. Curated is the first read-permitted layer.

DPDP deferred to phase two. PII masking is retrofitted after an internal audit or a near-miss incident. Rewriting three years of Parquet files is expensive and operationally risky. Fix: Apply masking and tokenisation in the curated-zone transformation pipeline from the first sprint. It is a one-time design decision, not an ongoing burden.

Exotic formats and tools the team cannot operate. A startup adopts Apache Hudi after a conference talk and then cannot hire anyone who knows it. Fix: Parquet + Delta Lake or Iceberg are mainstream, well-documented and natively supported on every major cloud. Choose boring and operable over exciting and opaque — boring platforms beat exciting ones in every quarterly operational review.

No retention enforcement job. GST data from 2017 sits alongside current data because the deletion pipeline was never built. When DPDP rights requests arrive, there is no auditable record of what was purged and when. Fix: Treat automated retention enforcement as a first-class pipeline with its own schedule, logging and alerting — not a manual task delegated to whoever has time.

Cost governance absent until the first bill arrives. Fix: Tag every resource on day one, set budget alerts at 80% and 100% of the monthly plan, and review cost-per-dataset in every sprint retrospective. Cost accountability is a cultural norm that must be established early, not rescued later.

Aligning with India's Digital Public Infrastructure

India's Digital Public Infrastructure (DPI) stack — GSTN, MCA V3, Account Aggregator, ONDC, OCEN, DigiLocker — is designed around open APIs and structured JSON. Each is a natural data lake source.

The Account Aggregator framework, governed under RBI guidelines, enables businesses to receive consented financial data from customers and counterparties — bank statements, GST returns, ITR acknowledgements — in structured JSON. Ingesting AA data into a lake alongside internal ERP data creates a vendor-assessment or credit-underwriting dataset of a kind that no traditional warehouse was designed to absorb cleanly.

ONDC order flows generate structured events: catalog queries, order placements, fulfilment updates, returns. In the lake these feed demand-forecasting models and reconcile against ERP inventory movements. MCA V3 API data enables company-level analytics — tracking filings, identifying compliance status, monitoring related-party structures — at a depth that manual downloads to Excel cannot match.

Architect with open formats — Parquet, Delta, Iceberg — and avoid vendor-specific extensions that limit portability. The DPI ecosystem will evolve; your lake must absorb new schemas without a multi-month migration project each time. That is the compounding advantage of schema-on-read: today's GSTN schema and tomorrow's revised e-invoice fields land in the same physical store without a redevelopment cycle.

Key Takeaways

A data lake stores data in its native format and applies schema at read time, making it suitable for the full range of Indian enterprise data: structured GST and ERP exports, semi-structured API JSON, and unstructured contracts, emails and call recordings.
Statutory retention obligations under the GST Rules (72 months), Income Tax Act (6 years) and Companies Act (8 years) make tiered object storage the economically and legally rational choice over keeping aged data in live relational tables.
The three-zone model — landing (raw and immutable), curated (trusted and cleansed) and consumption (business-ready) — prevents the data swamp failure mode when enforced through catalog-gated pipeline deployment from day one.
DPDP 2023 requires purpose tagging, column-level masking, erasure capability and access logging in any architecture that holds personal data; retrofitting these across years of existing files is far more expensive than building them into zone-one design.
Query cost, not storage cost, is the primary TCO risk: partition tables by date and GSTIN, compress to Parquet, expire snapshots and tag every dataset with a cost-allocation owner from the first deployment.
A GST reconciliation lake at mid-market scale can cost under Rs. 2 lakh per annum to run while eliminating five-figure interest and penalty exposure and recovering 30–35 hours of monthly manual reconciliation effort.
India's DPI ecosystem emits evolving, schema-flexible JSON from GSTN, MCA V3, Account Aggregator and ONDC; open formats and schema-on-read make the lake the natural convergence point for this data, with analytic value compounding as each new integration adds context to the existing store.

Data Lake in Modern Business

Data Lake in Modern Business

What a Data Lake Is — and What It Is Not

Why Indian Businesses Are Adopting Data Lakes in FY 2026-27

Reference Architecture: The Three-Zone Model

Zone 1 — Landing (Raw)

Zone 2 — Curated (Trusted)

Zone 3 — Consumption

Governance Layer (Horizontal Across All Zones)

Governing a Data Lake Under the DPDP Act, 2023

Worked Example: GST Reconciliation Lake for a Mid-Size Manufacturer

Total Cost of Ownership: What the Cloud Bill Actually Looks Like

Common Pitfalls and How to Fix Them

Aligning with India's Digital Public Infrastructure

Key Takeaways

Frequently Asked Questions

Mayank Wadhera

Related Posts

Trademark Accounting and Amortisation: How to Record Your Brand as an Asset

LLP Audit Applicability: When Does Your LLP Need a Statutory or Tax Audit?

Statutory Audit vs Tax Audit vs Internal Audit — What's the Difference