Legal Suvidha is a registered trademark. Unauthorized use of our brand name or logo is strictly prohibited. All rights to this trademark are protected under Indian intellectual property laws.
Legal Suvidha
Goods & Service Tax (GST)

Designing a Data Warehouse for GST

Designing a data warehouse for GST in India begins with use cases like 2B reconciliation, ITC eligibility, refund ageing and audit readiness, not technology. The warehouse pulls data from the ERP, ASP-GSP, GSTN APIs, e-way bill system and MCA, then organises it into a star schema with an invoice line-item fact, GSTIN, HSN, calendar and document-type dimensions, Indian financial-year alignment, partitioning by month and DPDP-aligned access controls.

Mayank WadheraMayank Wadhera
Published: 26 Jun 2023
Updated: 23 May 2026
13 min read
Designing a Data Warehouse for GST
1
2
3
4
5
6
7
8
9
10
11
12

Design a GST data warehouse in 2026 that consolidates returns, e-invoices and ERP data into one defensible source of truth for Indian businesses.

Designing a Data Warehouse for GST

A GST data warehouse consolidates your GSTR-1, GSTR-2B, GSTR-3B, e-invoices, e-way bills and ERP ledgers into a single governed source of truth. For FY 2026-27, with GSTN's expanded API ecosystem and mandatory e-invoicing covering virtually every registered business above Rs. 5 crore turnover, building this warehouse is no longer optional engineering ambition β€” it is the operational infrastructure that makes month-end ITC reconciliation, vendor compliance scoring, refund tracking and audit defence feasible at scale without burning your accounts team every fortnight.


Define Use Cases Before You Write a Single Line of SQL

The most expensive GST data warehouse mistake is starting with technology. Teams spin up a Snowflake account, dump a year of GSTR-2B CSVs, and then discover six months later that nobody can answer the CFO's question: "How much ITC is sitting in the ledger but not yet reflected in 3B?"

Before touching infrastructure, map the questions the warehouse must answer. A practical use-case list for FY 2026-27 looks like this:

  • Month-end 2B reconciliation: Which supplier invoices are in my books but absent from GSTR-2B? What is the rupee value of at-risk ITC?
  • ITC eligibility tracking: Which credits are blocked under Section 17(5) of the CGST Act, 2017? What portion of input credit relates to exempt supplies and needs reversal under Rule 42?
  • Refund ageing: How many refund applications filed on the GST portal are beyond 60 days without an order? What is the total refund amount outstanding by application type?
  • Vendor compliance score: Which suppliers have a consistent GSTR-1 non-filing history, making my ITC unreliable?
  • Audit readiness: Can I produce a complete reconciliation between my audited financials and GST returns within 48 hours of receiving an SCN (Show Cause Notice)?
  • Inter-state stock transfer intelligence: What are the GST implications of branch-to-branch transfers for transfer-pricing documentation?

Each use case drives a distinct set of grain, freshness and access decisions. The reconciliation use case needs data fresh to the 14th of each month (when GSTR-2B is generated). The audit readiness use case needs six years of signed e-invoice JSON. Defining use cases upfront prevents you from over-engineering one dimension and under-building another.


Source Systems You Must Integrate

A GST warehouse without complete source coverage produces reconciliations you cannot trust. Map every system before your data engineers begin pipelines.

ERP (SAP, Oracle, Tally, Microsoft Dynamics or homegrown):

  • Sales register: invoice date, invoice number, buyer GSTIN, taxable value, IGST, CGST, SGST, HSN, document type
  • Purchase register: same fields on the vendor side, plus PO reference and three-way match status
  • Journal vouchers: for manual credit notes, debit notes and year-end provisions
  • Master data: vendor and customer GSTIN, PAN, address, registration category

ASP/GSP layer (Application Service Provider / GST Suvidha Provider):

  • GSTR-1 data as filed (outward supply details)
  • GSTR-2B (auto-drafted inward supply statement generated on the 14th of each following month)
  • GSTR-3B filed data and challan payment records
  • E-invoice IRN (Invoice Reference Number) status: generated, cancelled, amended

GSTN APIs (via GSP or direct NIC/IRP access):

  • Taxpayer search: GSTIN registration status, legal name, principal place of business, cancellation date
  • Filing status by GSTIN and period: has your supplier filed their GSTR-1 for the period in question?
  • Electronic Credit Ledger and Cash Ledger balances
  • Refund application status by ARN (Acknowledgement Reference Number)

E-Way Bill system:

  • EWB number, date, validity, document reference
  • Vehicle and transporter details
  • Origin and destination GSTIN, distance
  • Cancellation and extension events

MCA V3 and external data:

  • Director identification, CIN, date of incorporation for related-party and vendor due-diligence overlays
  • CIBIL/Dun & Bradstreet flags where your vendor risk policy demands them

Leaving out even one of these sources creates blind spots. A warehouse that has e-invoices but not EWB data cannot confirm whether goods actually moved to support the supply claim.


A star schema fits GST analytics better than a normalised relational model because finance users query it with aggregations (sum of ITC, count of non-compliant vendors) and BI tools generate efficient SQL against it without requiring complex joins.

Core Invoice Fact Table

Build the central fact at line-item grain β€” one row per invoice line, not per invoice. This is the only grain that supports HSN-level analysis, which GSTN's enhanced audit scrutiny in 2026 demands.

Key measures on the fact:

  • taxable_value_inr
  • igst_amount, cgst_amount, sgst_amount, cess_amount
  • itc_eligible_flag (pre-computed in ETL, not in BI)
  • itc_reversal_amount (Rule 42/43 apportionment)
  • matched_in_gstr2b_flag
  • irn_generated_flag, irn_cancelled_flag
  • ewb_linked_flag

Surrogate keys, not natural keys (invoice number, GSTIN), in every dimension join. Natural keys change β€” a supplier's trade name changes after a merger; an HSN code reclassifies mid-year.

Dimension Tables

DimensionKey AttributesSCD Type
dim_gstinLegal name, trade name, state, registration category, cancellation dateType 2
dim_calendarDate, GST tax period, Indian FY, quarter, month nameType 1
dim_hsnHSN/SAC code, description, GST rate, rate-change effective dateType 2
dim_document_typeInvoice, credit note, debit note, bill of supplyType 1
dim_transaction_typeB2B, B2C, exports, SEZ, deemed exports, RCMType 1

Use Type 2 Slowly Changing Dimensions (SCD-2) for dim_gstin and dim_hsn. A vendor that was GST-compliant in July 2025 but had their registration cancelled in November 2025 must retain both states so your historical ITC analysis remains accurate. SCD-2 achieves this by inserting a new row with valid_from and valid_to dates rather than overwriting.

Align dim_calendar explicitly to the Indian Financial Year (April–March) and the GST tax period (calendar month for most taxpayers, quarter for QRMP taxpayers). Analysts should never have to mentally convert.

Supporting Fact Tables

  • fact_refund_application: one row per ARN, with application date, amount claimed by head (IGST, CGST, SGST), sanction amount, order date and ageing in days
  • fact_irn_event: one row per IRN lifecycle event (generated, cancelled, amended) with timestamp; this is your e-invoice audit trail
  • fact_ewb_transaction: one row per e-way bill with movement and validity attributes
  • fact_gstr3b_liability: one row per GSTIN per period with declared liability, ITC claimed and net cash payment β€” this bridges your ERP liability with what you told GSTN

ETL Design and Data Freshness

ETL (Extract, Transform, Load) discipline separates a warehouse that analysts trust from one they work around.

Landing zone first. Ingest raw data in its source format before transforming anything. For e-invoices, this means persisting the signed JSON payload β€” IRN, QR code, supplier and buyer details β€” in your landing zone exactly as received from the IRP. The Income-tax Act, 1961 and the CGST Act, 2017 both recognise electronic records, and a signed JSON is your strongest audit evidence. Do not discard it after transformation.

Idempotent loads. GSTR-2B is regenerated when a supplier files a belated GSTR-1. Your ETL must detect that the 2B for a past period has changed and reprocess affected records without duplicating history. Use a source_hash or row_version column on every fact row.

Freshness SLAs by dataset:

DatasetRecommended FreshnessWhy
GSTR-2BLoaded by 15th of each monthReconciliation teams need it at month close
E-invoice IRNNear-real-time (4-hour lag)Finance approves shipments against IRN status
ERP invoice dataDaily overnightMatches operational rhythm
GSTN taxpayer masterWeeklyRegistration status does not change daily
Refund ARN statusDailyAgeing dashboards are time-sensitive

Publish these SLAs in your data catalogue (covered below) so finance users know when to expect updated numbers rather than repeatedly querying the data team.


Worked Example: GSTR-2B Reconciliation at Rs. Scale

Consider a mid-sized manufacturing company with Rs. 180 crore annual purchases in FY 2026-27. In October 2026, the finance team runs the month-end reconciliation.

ERP purchase register for October 2026:

  • 634 purchase invoices
  • Total ITC as per books: Rs. 1,42,80,000

GSTR-2B for October 2026 (generated 14 November 2026):

  • 611 invoices matched to supplier GSTINs
  • Total ITC per 2B: Rs. 1,36,50,000

Warehouse reconciliation output (automated):

CategoryInvoice CountITC at Risk (Rs.)
In ERP, not in 2B β€” supplier GSTR-1 not filed144,20,000
In ERP, not in 2B β€” GSTIN mismatch61,35,000
In ERP, not in 2B β€” HSN code discrepancy375,000
Total gap236,30,000

Without the warehouse, an analyst reconciles this manually in Excel across three CSVs over three working days. With the warehouse, the same output runs as a scheduled query and lands in the finance team's inbox by 6 a.m. on 15 November.

What to do with the Rs. 4,20,000 non-filing gap: Under Rule 36(4) of the CGST Rules, as amended, ITC claims must reconcile with GSTR-2B. If your supplier does not file by the time you finalise 3B, you must either defer the credit or reverse it and re-claim in the period when 2B reflects it. A warehouse that tracks this by vendor and aging bucket gives your team a prioritised call list for the AP department to chase suppliers before 3B filing.


Performance, Retention and Storage Tiering

Partition and cluster your invoice fact by GST tax period (month) and GSTIN. Most reconciliation queries filter by period first, then by GSTIN. Proper partitioning on modern column-store platforms (BigQuery, Snowflake, Fabric) can reduce query cost by 80–90% compared to a full-table scan.

72-month retention rule. Section 36 of the CGST Act, 2017 requires every registered person to maintain records for 72 months (six years) from the due date of the Annual Return (GSTR-9) for that year. For FY 2019-20, for example, that period runs until at least late 2026. Your warehouse must retain β€” and make queryable β€” six complete financial years of invoice, return and payment data at all times.

Design a two-tier storage architecture:

  • Hot tier: current FY plus two previous FYs β€” fast query, higher cost
  • Cold/archival tier: remaining retention periods β€” compressed, slower queries, lower cost

Most cloud platforms support this natively (BigQuery long-term storage pricing, Snowflake data retention + time-travel, AWS S3 lifecycle policies for Redshift Spectrum). Define lifecycle policies at build time, not after you receive the first storage cost shock.


Security and DPDP Alignment

The Digital Personal Data Protection Act, 2023 (DPDP Act) and its Rules (as notified) apply to warehouse copies of GST data because those copies contain personal data of individuals β€” proprietors, partners, directors, and sometimes named authorised signatories β€” embedded in GSTIN registration records, e-invoices and EWB data.

Implement these controls at build time:

  1. Role-based access at GSTIN scope. A regional finance manager should only query records belonging to GSTINs in their state. Implement row-level security in your warehouse platform using the state_code (digits 1–2 of the 15-character GSTIN) as the access predicate.
  1. Column-level masking for personal identifiers. Mask PAN, Aadhaar references, individual names and contact numbers in all views exposed to non-finance users (BI dashboards, data science sandboxes). Unmask only in privileged views with logged access.
  1. Query logging. Every query that touches a table tagged PERSONAL_DATA must generate an audit log entry: user, timestamp, predicate, rows returned. The DPDP Act's accountability principle requires you to demonstrate that personal data is accessed only for the declared purpose (GST compliance, not, say, credit profiling).
  1. Documented lawful basis. Maintain a data-processing register that records why each personal-data field exists in the warehouse, the legal basis (contractual necessity for supplier invoices, legal obligation for statutory compliance), and the retention period. This is not a warehouse-build task β€” but the warehouse architect must reserve columns and access policies for it from day one.
  1. Data principal rights. If a vendor (who is an individual proprietor) exercises a deletion or correction right under the DPDP Act, your warehouse must support a targeted purge without cascading breaks in aggregated return filings. Design soft-delete flags, not hard deletes, and ensure aggregated fact rows do not retain personally identifying natural keys.

Common Mistakes and Pitfalls to Avoid

1. Using invoice number as a dimension key. Invoice numbers are not unique across GSTINs, periods, or after amendments. Always use surrogate keys.

2. Storing only the latest 2B snapshot. GSTR-2B is retroactively amended when suppliers file belated GSTR-1 returns. If you overwrite 2B data on each pull, you lose the history of what was available at the time 3B was filed β€” which is exactly what an auditor will ask about.

3. Computing ITC eligibility in the BI layer. Section 17(5) blocked credit rules, Rule 42 reversals and Rule 86B cash payment restrictions involve conditional logic with many edge cases. Encoding this in a BI tool's calculated field means every analyst can implement it differently. Pre-compute a single, reviewed itc_eligible_flag in ETL, with a documented rulebook.

4. Ignoring the e-way bill source. A reconciliation that confirms an invoice exists in both the ERP and GSTR-2B but cannot confirm physical movement has a significant audit gap. Particularly for high-value goods, the EWB dimension is essential.

5. Building without a data catalogue. Within 18 months, a warehouse without documentation becomes a system only its original builder understands. Analysts stop trusting it; shadow Excel spreadsheets proliferate; the warehouse asset depreciates rapidly.

6. Treating QRMP and monthly filers identically. Under the QRMP (Quarterly Return Monthly Payment) scheme, small taxpayers file GSTR-1 quarterly but pay tax monthly via PMT-06 challan. Their GSTR-2B contribution appears only at quarter-end. Build separate processing logic for QRMP-supplier invoices to avoid false reconciliation alerts mid-quarter.


Platform Selection: A Practical Framework

Indian enterprises in 2026 evaluate Snowflake, Databricks, Microsoft Fabric, Google BigQuery and AWS Redshift. The right answer depends on four factors, not fashion:

  • Existing cloud commitment: If your ERP is Azure-hosted, Fabric has a friction advantage. If you are AWS-native, Redshift Spectrum + S3 may be simplest.
  • Data engineering skill: Databricks demands Spark expertise your team may not have. Snowflake is SQL-first and easier to staff.
  • Governance maturity: Fabric and Databricks Unity Catalog offer integrated governance that standalone Redshift does not match without add-ons.
  • Total cost at GST scale: Run a paid proof of concept on three months of e-invoices plus one full year of returns β€” roughly the data volume that exercises partitioning, concurrency and compression meaningfully. Benchmark cost per reconciliation query, not just storage price.

Do not dismiss mature on-premises options (PostgreSQL + columnar extensions, Microsoft SQL Server with columnstore indexes) for businesses where data sovereignty concerns or existing infrastructure costs make cloud migration uneconomic. A well-designed on-premises warehouse on proven hardware outperforms a poorly governed cloud instance every time.


Governance, Catalogue and Lineage

A governed warehouse earns trust; an ungoverned one generates shadow spreadsheets.

Data catalogue: Every table and view needs an owner, a sensitivity classification (CONFIDENTIAL β€” GST COMPLIANCE, INTERNAL, RESTRICTED β€” PERSONAL DATA), a retention label, a freshness SLA and a plain-language description. Automate catalogue updates from your ETL pipeline metadata rather than maintaining it manually.

Lineage: When the CFO asks why the ITC dashboard shows Rs. 2 lakh less than last week, lineage tracking lets you trace backward from the BI tile β†’ the SQL view β†’ the fact table β†’ the 2B ETL job β†’ the raw 2B file β†’ the ASP/GSP API response. Without lineage, you are guessing.

Product ownership: Assign one named business owner (typically the head of indirect tax or Group CFO's office) who signs off on the data model, approves new use cases and owns the quarterly stakeholder review. Without a named owner, the warehouse drifts and nobody feels accountable for its accuracy.

Prune aggressively. Unused tables and dashboards consume storage, confuse users and create compliance risk (data you forgot you had). Quarterly usage reviews β€” delete any table with zero queries in 90 days β€” keep the asset focused and trustworthy.


Key Takeaways

  • Use cases drive architecture. Define the questions β€” 2B reconciliation, ITC eligibility, refund ageing, vendor compliance β€” before selecting a platform.
  • Star schema at line-item grain is the right model for GST analytics. Build it with surrogate keys and Type-2 SCDs for GSTIN and HSN.
  • Pre-compute ITC eligibility in ETL, not in BI tools, and maintain a single documented rulebook for Section 17(5) and Rule 42 logic.
  • Retain 72 months of data as mandated by Section 36 of the CGST Act; design hot/cold storage tiers from day one to manage cost as the archive grows.
  • Persist signed e-invoice JSON in the landing zone β€” it is your primary audit evidence and cannot be reconstructed from transformed data.
  • DPDP Act compliance is not optional for warehouse copies of GST data. Implement row-level GSTIN-scoped access, column-level masking and query logging at build time.
  • A data catalogue with lineage is what converts a warehouse from a black box into a trusted, auditable asset. Without it, the warehouse depreciates faster than the hardware it runs on.

Frequently Asked Questions

Why build a dedicated data warehouse for GST?
GST analytics requires unified data across the ERP, e-invoice portal, e-way bill system and GSTN APIs. A dedicated warehouse provides a defensible single source of truth, enables fast reconciliations, supports audit responses, and turns compliance data into strategic management information for Indian businesses.
What schema works best for GST data?
A star schema with an invoice line-item fact and conformed dimensions for GSTIN, HSN, calendar, document type and transaction type performs well. Use Type 2 slowly changing dimensions for attributes like GSTIN status, and add separate fact tables for refunds, IRN events and e-way bills.
How long should the warehouse retain GST data?
Align retention to Section 35 of the CGST Act, which prescribes at least 72 months from the due date of the relevant annual return. Older periods can move to cold archival storage, with appropriate access logging and DPDP-aligned controls maintained throughout the retention window.
How do I secure the GST warehouse?
Apply role-based access at GSTIN scope, encrypt data at rest and in transit, mask personal identifiers in non-finance views, log queries touching personal data, and document the lawful basis for processing under the Digital Personal Data Protection Act, 2023 and its 2025 implementing rules.
Mayank Wadhera
Content Reviewed By

CA | CS | CMA | Lawyer | Insolvency Professional | IBBI Valuator

"I help founders increase real business value and achieve stronger valuations | Turning messy workflows into scalable, time-saving systems"

Share this article:

Related Posts

View All