Legal Suvidha is a registered trademark. Unauthorized use of our brand name or logo is strictly prohibited. All rights to this trademark are protected under Indian intellectual property laws.
Legal Suvidha
Accounting And Audit

Data Lake Architecture

A 2026 data lake architecture for Indian enterprises uses a tiered bronze, silver and gold model on an Indian-region object store with decoupled elastic compute. Bronze holds raw ingests from ERP, GST and e-invoicing systems; silver cleanses and conforms; gold delivers business metrics. Open table formats like Delta, Iceberg or Hudi enable multiple engines, while column encryption, tokenisation, lineage tracking and retention policies satisfy DPDP and Companies Act audit requirements.

Mayank WadheraMayank Wadhera
Published: 5 Jul 2023
Updated: 23 May 2026
15 min read
Data Lake Architecture
1
2
3
4
5
6
7
8
9
10
11

Design a 2026-ready data lake architecture for Indian enterprises balancing AI workloads, GST analytics, DPDP compliance and audit-trail retention.

Data Lake Architecture for Indian Enterprises: A 2026-Ready Design Guide

A data lake stores raw data in object storage decoupled from compute, then organises it through bronze (raw), silver (cleansed) and gold (aggregated) layers. For Indian enterprises in FY 2026-27, the right architecture lets a single platform retain ten years of GST returns, e-invoices and IoT streams affordably, enforce Digital Personal Data Protection (DPDP) Act 2023-grade access controls on PAN and Aadhaar data, and simultaneously power BI dashboards, ML models and GenAI copilots — without replicating data into separate silos.


Why Indian Enterprises Need a Data Lake in 2026

The regulatory data burden on Indian businesses has compounded sharply. A GST-registered manufacturer with a monthly e-invoicing volume of 10,000 documents generates roughly 240 MB of structured JSON data per year from invoices alone — before adding e-way bills, GSTR-1 reconciliation files, credit notes and ITC registers. Add ERP transaction logs, bank statement feeds, factory-floor IoT sensor streams and customer interaction data, and a Rs. 300-crore revenue company routinely accumulates 5–15 TB of operational data per year.

Traditional on-premises data warehouses were not designed for this. They enforce schema-on-write, making it expensive to ingest unstructured or semi-structured data. Storage and compute are coupled, so the database server runs at full cost even during overnight idle hours. Retention beyond three to four years becomes a capacity and licensing headache — precisely when the Companies Act 2013 and CGST Rules 2017 require you to hold data for six to eight years.

A data lake solves these problems structurally. Object storage in an Indian cloud region — AWS ap-south-1 (Mumbai), ap-south-2 (Hyderabad), Azure Central India, or GCP asia-south1 (Mumbai) — costs a fraction of database block storage, accepts any file format, and scales to petabytes without a procurement cycle. Compute is provisioned on demand: a Spark cluster spins up for a quarterly audit data pull and shuts down 90 minutes later, billing only for what it used.

The lakehouse pattern extends this by layering ACID transaction guarantees and SQL semantics on top of object storage, using open table formats such as Delta Lake, Apache Iceberg, or Apache Hudi. Finance teams who need consistent, point-in-time snapshots of the general ledger get them. Data scientists who need the same raw events for model training get them too. One storage layer, multiple consumers, no duplication.


The Bronze-Silver-Gold-Semantic Model in Detail

The medallion architecture — bronze, silver, gold, semantic — is not a naming convention. Each layer has distinct ownership, quality contracts and access patterns. Getting the boundaries wrong is where most enterprise data lake projects fail.

Bronze: The Immutable Audit Record

Bronze is the raw landing zone. Every source system writes to bronze exactly as it produced the data — no transformations, no corrections, no schema enforcement. This preserves your audit trail in its original form.

For Indian compliance, bronze must retain:

  • GST data: GSTR-1, GSTR-2B, GSTR-3B filings in original JSON/CSV; e-invoice IRN payloads from the IRP (Invoice Registration Portal); e-way bill logs
  • ERP extracts: Full journal entry tables, AP/AR ageing, fixed-asset registers — timestamped daily or on every change event
  • Banking feeds: Bank statement files (MT940, CSV, PDF-to-text) from all operating accounts
  • Payroll: PF contribution files, TDS deduction registers (Form 24Q source data)
  • External regulatory correspondence: GST portal notices, Income Tax demand orders, MCA V3 filings — stored as PDFs alongside structured metadata

Retention policy on bronze: Under Section 128 of the Companies Act 2013, books of account must be maintained for eight years from the end of the relevant financial year. Under Rule 56 of the CGST Rules 2017, GST records must be retained for 72 months (six years) from the due date of the annual return. Your bronze retention policy must satisfy the stricter of these — effectively eight years for most records. Configure object-storage lifecycle rules to move bronze data to a low-cost archive tier after 18 months; keep it retrievable on demand at cold-storage pricing.

Silver: Curated, Conformed Entities

Silver applies business rules to bronze data to produce clean, deduplicated, standardised entities. A silver customer entity merges CRM records, GST registration data (GSTIN lookup), and e-invoice buyer fields into a single master with a unique internal ID. A silver invoice entity standardises tax-code classifications across different ERP versions or acquired-company systems.

This is the layer that finance and tax teams use for reconciliation queries. "Show me all invoices where GST was charged at 18% but the HSN code maps to a 12% rate" — that is a silver-layer query. Every silver entity should have a documented schema, a nominated data owner, and a freshness SLA.

Gold: Metric-Ready Aggregates

Gold tables are business-domain aggregates designed for fast read performance. Examples:

  • daily_revenue_by_channel — net revenue after GST, by sales channel, by day
  • gst_liability_monthly — output tax, ITC, net payable, by GSTIN, by period
  • working_capital_weekly — trade debtors, creditors and inventory in Rs. by business unit

Gold tables are rebuilt from silver on a defined schedule (typically daily for financial metrics, hourly for operational KPIs). They are derived views, not systems of record. If a gold table produces a wrong number, fix the silver transformation and recompute — never patch the gold aggregate directly.

Semantic Layer

Above gold sits a governed catalog of named, calculated metrics with agreed definitions. Tools such as dbt Semantic Layer, Looker, or Cube handle this. The semantic layer ensures that "revenue" in the CFO's dashboard and "revenue" in the sales team's report are computed identically. Without this layer, every team defines its own metrics and the organisation spends its analytics budget arguing about numbers rather than acting on them.


Storage, Compute and Open Table Format Choices

FormatBest forKey strengthTypical Indian stack
Delta LakeDatabricks-heavy shopsTime travel, Z-ordering, strong DMLDatabricks on AWS India
Apache IcebergMulti-engine environmentsAthena, Trino, Flink, Spark all read nativelyAWS-native stacks with Lake Formation
Apache HudiHigh-frequency CDC/upsertsNear-real-time, incremental processingBanking and fintech CDC pipelines

All three formats support ACID transactions, schema evolution, and time travel — meaning you can query your data as of a specific historical date. For audit and litigation-hold scenarios, this is invaluable: you do not need to restore from a backup; you query the table's history directly using a VERSION AS OF or TIMESTAMP AS OF clause.

For compute architecture, decouple interactive and batch workloads from the start:

  • Batch ETL (bronze → silver → gold): Scheduled Spark jobs on EMR, Databricks Jobs, or Azure Synapse Pipelines
  • Interactive BI queries: Serverless query engines — Amazon Athena, Trino, or Databricks SQL Warehouse — pay-per-query
  • ML training: Isolated GPU clusters provisioned on demand; never run on shared ETL infrastructure
  • Ad hoc analyst queries: Notebook environments with auto-terminating clusters

DPDP Compliance and Data Governance on Your Lake

The Digital Personal Data Protection Act 2023 (DPDP Act) makes any entity that determines the purpose and means of processing personal data a data fiduciary. If your lake holds customer names, mobile numbers, email addresses, PAN, Aadhaar numbers, bank account details, or IP addresses — and it almost certainly does — you carry these obligations.

Four architecture decisions the DPDP Act forces:

  1. Tokenise PII at ingestion. When an ERP event lands in bronze, replace PAN, Aadhaar, and bank account numbers with non-reversible tokens (or reversible vault tokens if you need to recover originals for legitimate use). Plain-text PAN in an S3 bucket is not acceptable under any interpretation of "appropriate security safeguards" under Section 8(5) of the Act.
  1. Column-level access controls. A finance analyst querying GST reconciliation should not have access to the same columns as a marketer running segmentation. Unity Catalog (Databricks), AWS Lake Formation, and Apache Ranger all support column-level and row-level security on top of open table formats.
  1. Data principal rights workflow. The Act grants individuals the right to correction and erasure of their data. Your lake must support selective deletion — hard deletes in Delta, Iceberg, and Hudi are supported through MERGE and DELETE operations, followed by VACUUM to purge underlying files. Design this workflow before you have 500 erasure requests in a queue.
  1. Tamper-proof audit logs. Under Section 8(6), a personal data breach must be notified to the Data Protection Board and affected individuals without undue delay. Your lake's access logs — who queried what, when — must be retention-enabled and immutable. Use object-store versioning plus a WORM (Write Once Read Many) bucket policy on your audit-log prefix.

Penalty context: Schedule 1 of the DPDP Act 2023 prescribes penalties of up to Rs. 250 crore for failure to implement adequate security safeguards, and up to Rs. 200 crore for failure to notify a data breach. The Data Protection Board is expected to become operational during FY 2026-27. Treat DPDP compliance as a hard architectural constraint, not a post-launch checklist item.


GST and Statutory Records: Retention Obligations by Record Type

When a GST officer issues a scrutiny notice or an income-tax assessment begins, the typical ask is: "Provide all invoices, credit notes, and ITC claims for [date range], reconciled against your GSTR-3B filings." If this data lives across six systems with no common invoice key, your team spends two weeks assembling it. If it lives in a well-designed silver layer keyed on IRN, the same query runs in 20 minutes.

Record typeGoverning provisionMinimum retention
Books of accountsCompanies Act 2013, Sec 1288 years from end of relevant FY
GST records and accountsCGST Rules 2017, Rule 5672 months from due date of annual return
E-invoices (IRN payload)GST advisory + Companies Act8 years (align to stricter)
TDS certificates (Form 16/16A)Income Tax Act 19617 years from relevant AY
Payroll and PF contribution registersEPF & MP Act 19525 years (longer for PF challans)
Bank statementsRBI guidelines + Companies Act8 years

Set your bronze lifecycle policies to match this table. Archive to Glacier Instant Retrieval or equivalent after 18 months of active use; move to Deep Archive / cold tier for years 5–8. Enable object lock with compliance mode on the archive prefix — this prevents both accidental deletion and a rogue admin override.


AI and Lakehouse Convergence: What to Plan For Now

Modern lakehouse platforms are converging compute for BI, ML and GenAI into one layer. For Indian enterprises, the near-term use cases are concrete:

  • GST risk scoring: Train a gradient-boosting model on historical ITC mismatches to flag high-risk claims before you file GSTR-3B
  • AR collection forecasting: ML models on silver-layer invoice-ageing tables to predict collection dates and optimise working-capital
  • Contract intelligence: Retrieval-augmented generation (RAG) over a vector index of vendor contracts — query "which contracts renew in the next 60 days with no auto-renewal clause?" in natural language
  • Customer support copilot: A GenAI assistant grounded in gold-layer CRM and order-history data, with every answer traceable to a source row

For all of these, data quality in silver matters more than model architecture. A RAG answer generated from inconsistently-defined silver data is worse than no answer — it is wrong with high confidence. Invest in silver-layer quality gates and a semantic layer before you select a model.

Plan for vector indexes as a first-class storage type in your lake. Databricks Vector Search, AWS OpenSearch, and pgvector on Amazon RDS are typical choices. Store embeddings adjacent to source documents in the lake so lineage is maintained: you can trace a copilot answer back to the specific e-invoice or contract that produced it.


Common Pitfalls — and How to Fix Them

Bronze Becomes a Swamp

Teams land files in bronze but never build silver transformations. After 18 months, bronze has 40 TB of data with no catalog, no schema documentation, and no data owner. Queries take hours; nobody trusts the output.

Fix: Every new source added to bronze must have a corresponding silver transformation committed within the same sprint. No bronze source goes live without a documented schema, a named data owner, and a silver consumer. Treat this as a deployment gate, not a guideline.

Plain-Text PII in Object Storage

PAN numbers, mobile numbers and email addresses land in a CSV in S3 with only default SSE encryption. A misconfigured bucket policy or a compromised IAM key exposes millions of records and a Rs. 250 crore penalty exposure.

Fix: Tokenise at the extraction layer — before data crosses the source system boundary. Use a vault service (HashiCorp Vault, AWS Secrets Manager with a tokenisation sidecar, or a dedicated microservice). Enforce via pipeline gate: any bronze pipeline writing columns classified as PII without a tokenised variant fails the pre-deployment policy check.

Metric Definitions Owned by No One

The finance team computes "net revenue" net of GST and returns. The sales team computes it gross of GST. The product team includes services revenue that finance books separately. Three different numbers appear in the same board deck.

Fix: Implement a semantic layer with a governed metrics store (dbt Semantic Layer, Cube, or LookML). Every named metric has a single owner, a written definition linked to the relevant accounting standard or GST treatment, and a daily automated test. Definition changes go through a review that includes finance and tax.

Small-File Proliferation

A Kafka consumer writes one Parquet file per event to bronze. After a month, a single day's partition has 50,000 files of 200 KB each. Listing and scanning these files takes longer than the actual computation.

Fix: Schedule compaction operations nightly on recently-written partitions — Delta Lake's OPTIMIZE, Iceberg's rewrite_data_files, Hudi's clustering. Target output file sizes of 128 MB–512 MB. This single change can reduce query latency by 5–10× on high-frequency event sources.

Compute Left Unmanaged

A data scientist runs a full-history training job on a 40-node cluster and forgets to terminate it over a long weekend. The monthly cloud bill jumps by Rs. 7–9 lakhs with no corresponding business output.

Fix: Tag every compute job with team, project, and cost centre at launch. Enforce auto-termination on all interactive clusters after 90 minutes of idle time. Produce a weekly chargeback report so business owners see their consumption and own their spend decisions.


Worked Example: A Mid-Size Indian Manufacturer, FY 2026-27

Scenario: Bharat Precision Parts Pvt. Ltd. — a Pune-based auto-ancillary manufacturer, Rs. 450 crore annual revenue, 12,000 e-invoices per month, two GSTIN registrations, 800 employees, SAP ERP, one MES system generating IoT sensor data, and a legacy on-premises data warehouse approaching end of vendor support.

Annual data volumes (estimated):

  • E-invoices: 12,000/month × 2 KB JSON = 24 MB/month → 288 MB/year from GST alone
  • ERP journal entries: ~5,00,000 entries/year × 1 KB = 500 MB/year
  • IoT sensor data (10 machines × 1 event/second × 200 bytes): ~60 GB/year
  • Payroll, HR, banking feeds, scanned documents: ~20 GB/year
  • Total new data per year: ~80 GB structured + semi-structured
  • 10-year lake footprint: ~800 GB raw + derived tables ≈ ~3–5 TB total

Illustrative cloud storage cost (AWS ap-south-1, approximate FY 2026-27 rates):

  • S3 Standard — hot tier, last 18 months (~150 GB): ~Rs. 375/month
  • S3 Glacier Instant Retrieval — warm archive, years 2–8 (~800 GB): ~Rs. 200/month
  • S3 Glacier Deep Archive — compliance cold tier, years 8–10: ~Rs. 80/month
  • Total storage: ~Rs. 655/month — a rounding error in the IT budget

Where cost actually lands — compute:

  • Daily ETL pipeline (bronze → silver → gold), 2-node Spark cluster, 2 hours/day: ~Rs. 12,000/month
  • BI query engine (Athena, pay-per-query at current workload): ~Rs. 8,000/month
  • Quarterly audit data pull (10-node cluster, 4 hours): Rs. 3,200/quarter → Rs. 800/month amortised
  • Total compute: ~Rs. 21,000/month

Total lake running cost: ~Rs. 22,000/month (Rs. 2.64 lakh/year)

Compare this to the legacy warehouse renewal: a comparable appliance licence and hardware refresh was quoted at Rs. 18–22 lakh/year, excluding DBA salary. The lakehouse delivers more capability — ML-ready, GenAI-ready, DPDP-compliant — at roughly 85% lower running cost.

DPDP compliance uplift (one-time build cost, approximate):

  • Tokenisation microservice: Rs. 8–12 lakh
  • Unity Catalog / Lake Formation access policy design: Rs. 3–5 lakh
  • Data Principal Rights workflow (erasure + correction APIs): Rs. 2–4 lakh
  • Total DPDP uplift: Rs. 13–21 lakh one-time

With penalties for inadequate security safeguards at up to Rs. 250 crore under the DPDP Act, a Rs. 15-lakh investment in compliance architecture has an immediate, asymmetric return.


Migration from a Legacy Warehouse: A Staged Approach

Do not attempt a big-bang migration. The cost of a failed migration — corrupted historical data, missed audit requests, broken BI reports during year-end close — far exceeds the cost of running parallel systems for two quarters.

Recommended five-stage migration:

  1. Months 1–2 — Foundation. Set up object storage with DPDP-grade access controls, the metadata catalog, and CI/CD pipelines for data pipelines. Ingest one non-critical source (e.g., web logs or IoT) end-to-end to validate architecture decisions.
  1. Months 3–4 — Compliance data first. Migrate GST data and ERP journal entries to bronze and silver. Run reconciliation queries against both old and new systems in parallel. Finance sign-off required before proceeding.
  1. Months 5–6 — Gold layer and BI. Build gold tables. Connect BI tools to the lakehouse. Run dashboards side-by-side. Identify discrepancies; fix silver transformations, not gold patches.
  1. Months 7–8 — Decommission. Once business stakeholders are confident across two consecutive month-end closes, switch BI tools to the lakehouse as primary. Freeze the old warehouse read-only for 90 days, then decommission.
  1. Month 9 onwards — AI workloads. With a clean, governed lake, begin ML and RAG use cases on the same infrastructure without new data-movement pipelines.

One absolute rule: preserve original schemas in bronze throughout the migration. Never rewrite history to fit a new schema. Auditors and tax officers want the original data as it was recorded, not a retrospectively cleaned version. Any "data improvement" during migration belongs in silver, applied as a documented transformation — not applied retroactively to bronze.


Key Takeaways

  • Bronze is your immutable audit trail. Land raw GST, ERP, banking and IoT data untransformed; retain for eight years minimum under Companies Act 2013 / CGST Rules 2017; move to archive storage tiers after 18 months to control cost without destroying compliance posture.
  • Silver is your trusted entity layer. All finance and tax reconciliation queries should run against silver; treat silver quality as seriously as your chart of accounts — every entity needs a schema, an owner, and a freshness SLA.
  • Open table formats (Delta Lake / Apache Iceberg / Apache Hudi) are non-negotiable. They prevent vendor lock-in, enable ACID transactions and time travel, and allow finance, data science and GenAI workloads to share one storage layer without replication.
  • DPDP Act 2023 compliance is a hard architectural input, not a post-launch checklist. Tokenise PAN, Aadhaar and bank data at ingestion; implement column-level access controls; build data-principal rights (erasure/correction) workflows before the Data Protection Board begins enforcement in FY 2026-27.
  • The real cost of a data lake is compute, not storage. A mid-size Indian manufacturer can run a fully compliant, production-grade lakehouse for under Rs. 25,000/month; implement auto-termination, job tagging and monthly chargeback from day one.
  • Migrate staged, not big-bang. Run BI queries against old and new systems in parallel for a full quarter; decommission the legacy warehouse only after finance sign-off across two consecutive month-end closes; never rewrite history in bronze.
  • AI readiness is a data-quality problem before it is a model problem. A GenAI copilot grounded in an inconsistently-defined silver layer produces wrong answers with dangerous confidence; invest in the semantic layer and silver quality gates before selecting a model or a vector database.

Frequently Asked Questions

What is the difference between a data lake and a lakehouse?
A data lake stores raw files in object storage with limited transactional semantics. A lakehouse adds ACID transactions, schema enforcement and SQL performance on top of the lake using open table formats such as Delta, Iceberg or Hudi, combining the cost of a lake with the discipline of a warehouse.
Is a data lake DPDP-compliant by default?
No. DPDP compliance depends on how you classify data, encrypt PII, control access and retain records. A data lake must implement column-level encryption, masking, access logs, retention policies and data principal request workflows to be considered DPDP-aligned.
Which open table format should Indian enterprises pick?
All three of Delta, Iceberg and Hudi are production-ready in 2026. Delta works well with Databricks-led stacks, Iceberg has the strongest multi-engine ecosystem, and Hudi excels at streaming upserts. Choose based on your primary compute engine and team familiarity rather than benchmarks alone.
How long should financial data be retained in a lake?
At least eight financial years to satisfy the Companies Act 2013 and income-tax rules, and typically ten years for litigation comfort. DPDP requires personal data to be deleted after the consented retention period, so the lake must support tiered retention by data type.
Mayank Wadhera
Content Reviewed By

CA | CS | CMA | Lawyer | Insolvency Professional | IBBI Valuator

"I help founders increase real business value and achieve stronger valuations | Turning messy workflows into scalable, time-saving systems"

Share this article:

Related Posts

View All