Design a GST data warehouse in 2026 that consolidates returns, e-invoices and ERP data into one defensible source of truth for Indian businesses.
Designing a Data Warehouse for GST
A GST data warehouse consolidates your GSTR-1, GSTR-2B, GSTR-3B, e-invoices, e-way bills and ERP ledgers into a single governed source of truth. For FY 2026-27, with GSTN's expanded API ecosystem and mandatory e-invoicing covering virtually every registered business above Rs. 5 crore turnover, building this warehouse is no longer optional engineering ambition β it is the operational infrastructure that makes month-end ITC reconciliation, vendor compliance scoring, refund tracking and audit defence feasible at scale without burning your accounts team every fortnight.
Define Use Cases Before You Write a Single Line of SQL
The most expensive GST data warehouse mistake is starting with technology. Teams spin up a Snowflake account, dump a year of GSTR-2B CSVs, and then discover six months later that nobody can answer the CFO's question: "How much ITC is sitting in the ledger but not yet reflected in 3B?"
Before touching infrastructure, map the questions the warehouse must answer. A practical use-case list for FY 2026-27 looks like this:
- Month-end 2B reconciliation: Which supplier invoices are in my books but absent from GSTR-2B? What is the rupee value of at-risk ITC?
- ITC eligibility tracking: Which credits are blocked under Section 17(5) of the CGST Act, 2017? What portion of input credit relates to exempt supplies and needs reversal under Rule 42?
- Refund ageing: How many refund applications filed on the GST portal are beyond 60 days without an order? What is the total refund amount outstanding by application type?
- Vendor compliance score: Which suppliers have a consistent GSTR-1 non-filing history, making my ITC unreliable?
- Audit readiness: Can I produce a complete reconciliation between my audited financials and GST returns within 48 hours of receiving an SCN (Show Cause Notice)?
- Inter-state stock transfer intelligence: What are the GST implications of branch-to-branch transfers for transfer-pricing documentation?
Each use case drives a distinct set of grain, freshness and access decisions. The reconciliation use case needs data fresh to the 14th of each month (when GSTR-2B is generated). The audit readiness use case needs six years of signed e-invoice JSON. Defining use cases upfront prevents you from over-engineering one dimension and under-building another.
Source Systems You Must Integrate
A GST warehouse without complete source coverage produces reconciliations you cannot trust. Map every system before your data engineers begin pipelines.
ERP (SAP, Oracle, Tally, Microsoft Dynamics or homegrown):
- Sales register: invoice date, invoice number, buyer GSTIN, taxable value, IGST, CGST, SGST, HSN, document type
- Purchase register: same fields on the vendor side, plus PO reference and three-way match status
- Journal vouchers: for manual credit notes, debit notes and year-end provisions
- Master data: vendor and customer GSTIN, PAN, address, registration category
ASP/GSP layer (Application Service Provider / GST Suvidha Provider):
- GSTR-1 data as filed (outward supply details)
- GSTR-2B (auto-drafted inward supply statement generated on the 14th of each following month)
- GSTR-3B filed data and challan payment records
- E-invoice IRN (Invoice Reference Number) status: generated, cancelled, amended
GSTN APIs (via GSP or direct NIC/IRP access):
- Taxpayer search: GSTIN registration status, legal name, principal place of business, cancellation date
- Filing status by GSTIN and period: has your supplier filed their GSTR-1 for the period in question?
- Electronic Credit Ledger and Cash Ledger balances
- Refund application status by ARN (Acknowledgement Reference Number)
E-Way Bill system:
- EWB number, date, validity, document reference
- Vehicle and transporter details
- Origin and destination GSTIN, distance
- Cancellation and extension events
MCA V3 and external data:
- Director identification, CIN, date of incorporation for related-party and vendor due-diligence overlays
- CIBIL/Dun & Bradstreet flags where your vendor risk policy demands them
Leaving out even one of these sources creates blind spots. A warehouse that has e-invoices but not EWB data cannot confirm whether goods actually moved to support the supply claim.
The Recommended Data Model: Star Schema for GST Analytics
A star schema fits GST analytics better than a normalised relational model because finance users query it with aggregations (sum of ITC, count of non-compliant vendors) and BI tools generate efficient SQL against it without requiring complex joins.
Core Invoice Fact Table
Build the central fact at line-item grain β one row per invoice line, not per invoice. This is the only grain that supports HSN-level analysis, which GSTN's enhanced audit scrutiny in 2026 demands.
Key measures on the fact:
taxable_value_inrigst_amount,cgst_amount,sgst_amount,cess_amountitc_eligible_flag(pre-computed in ETL, not in BI)itc_reversal_amount(Rule 42/43 apportionment)matched_in_gstr2b_flagirn_generated_flag,irn_cancelled_flagewb_linked_flag
Surrogate keys, not natural keys (invoice number, GSTIN), in every dimension join. Natural keys change β a supplier's trade name changes after a merger; an HSN code reclassifies mid-year.
Dimension Tables
| Dimension | Key Attributes | SCD Type |
|---|---|---|
dim_gstin | Legal name, trade name, state, registration category, cancellation date | Type 2 |
dim_calendar | Date, GST tax period, Indian FY, quarter, month name | Type 1 |
dim_hsn | HSN/SAC code, description, GST rate, rate-change effective date | Type 2 |
dim_document_type | Invoice, credit note, debit note, bill of supply | Type 1 |
dim_transaction_type | B2B, B2C, exports, SEZ, deemed exports, RCM | Type 1 |
Use Type 2 Slowly Changing Dimensions (SCD-2) for dim_gstin and dim_hsn. A vendor that was GST-compliant in July 2025 but had their registration cancelled in November 2025 must retain both states so your historical ITC analysis remains accurate. SCD-2 achieves this by inserting a new row with valid_from and valid_to dates rather than overwriting.
Align dim_calendar explicitly to the Indian Financial Year (AprilβMarch) and the GST tax period (calendar month for most taxpayers, quarter for QRMP taxpayers). Analysts should never have to mentally convert.
Supporting Fact Tables
fact_refund_application: one row per ARN, with application date, amount claimed by head (IGST, CGST, SGST), sanction amount, order date and ageing in daysfact_irn_event: one row per IRN lifecycle event (generated, cancelled, amended) with timestamp; this is your e-invoice audit trailfact_ewb_transaction: one row per e-way bill with movement and validity attributesfact_gstr3b_liability: one row per GSTIN per period with declared liability, ITC claimed and net cash payment β this bridges your ERP liability with what you told GSTN
ETL Design and Data Freshness
ETL (Extract, Transform, Load) discipline separates a warehouse that analysts trust from one they work around.
Landing zone first. Ingest raw data in its source format before transforming anything. For e-invoices, this means persisting the signed JSON payload β IRN, QR code, supplier and buyer details β in your landing zone exactly as received from the IRP. The Income-tax Act, 1961 and the CGST Act, 2017 both recognise electronic records, and a signed JSON is your strongest audit evidence. Do not discard it after transformation.
Idempotent loads. GSTR-2B is regenerated when a supplier files a belated GSTR-1. Your ETL must detect that the 2B for a past period has changed and reprocess affected records without duplicating history. Use a source_hash or row_version column on every fact row.
Freshness SLAs by dataset:
| Dataset | Recommended Freshness | Why |
|---|---|---|
| GSTR-2B | Loaded by 15th of each month | Reconciliation teams need it at month close |
| E-invoice IRN | Near-real-time (4-hour lag) | Finance approves shipments against IRN status |
| ERP invoice data | Daily overnight | Matches operational rhythm |
| GSTN taxpayer master | Weekly | Registration status does not change daily |
| Refund ARN status | Daily | Ageing dashboards are time-sensitive |
Publish these SLAs in your data catalogue (covered below) so finance users know when to expect updated numbers rather than repeatedly querying the data team.
Worked Example: GSTR-2B Reconciliation at Rs. Scale
Consider a mid-sized manufacturing company with Rs. 180 crore annual purchases in FY 2026-27. In October 2026, the finance team runs the month-end reconciliation.
ERP purchase register for October 2026:
- 634 purchase invoices
- Total ITC as per books: Rs. 1,42,80,000
GSTR-2B for October 2026 (generated 14 November 2026):
- 611 invoices matched to supplier GSTINs
- Total ITC per 2B: Rs. 1,36,50,000
Warehouse reconciliation output (automated):
| Category | Invoice Count | ITC at Risk (Rs.) |
|---|---|---|
| In ERP, not in 2B β supplier GSTR-1 not filed | 14 | 4,20,000 |
| In ERP, not in 2B β GSTIN mismatch | 6 | 1,35,000 |
| In ERP, not in 2B β HSN code discrepancy | 3 | 75,000 |
| Total gap | 23 | 6,30,000 |
Without the warehouse, an analyst reconciles this manually in Excel across three CSVs over three working days. With the warehouse, the same output runs as a scheduled query and lands in the finance team's inbox by 6 a.m. on 15 November.
What to do with the Rs. 4,20,000 non-filing gap: Under Rule 36(4) of the CGST Rules, as amended, ITC claims must reconcile with GSTR-2B. If your supplier does not file by the time you finalise 3B, you must either defer the credit or reverse it and re-claim in the period when 2B reflects it. A warehouse that tracks this by vendor and aging bucket gives your team a prioritised call list for the AP department to chase suppliers before 3B filing.
Performance, Retention and Storage Tiering
Partition and cluster your invoice fact by GST tax period (month) and GSTIN. Most reconciliation queries filter by period first, then by GSTIN. Proper partitioning on modern column-store platforms (BigQuery, Snowflake, Fabric) can reduce query cost by 80β90% compared to a full-table scan.
72-month retention rule. Section 36 of the CGST Act, 2017 requires every registered person to maintain records for 72 months (six years) from the due date of the Annual Return (GSTR-9) for that year. For FY 2019-20, for example, that period runs until at least late 2026. Your warehouse must retain β and make queryable β six complete financial years of invoice, return and payment data at all times.
Design a two-tier storage architecture:
- Hot tier: current FY plus two previous FYs β fast query, higher cost
- Cold/archival tier: remaining retention periods β compressed, slower queries, lower cost
Most cloud platforms support this natively (BigQuery long-term storage pricing, Snowflake data retention + time-travel, AWS S3 lifecycle policies for Redshift Spectrum). Define lifecycle policies at build time, not after you receive the first storage cost shock.
Security and DPDP Alignment
The Digital Personal Data Protection Act, 2023 (DPDP Act) and its Rules (as notified) apply to warehouse copies of GST data because those copies contain personal data of individuals β proprietors, partners, directors, and sometimes named authorised signatories β embedded in GSTIN registration records, e-invoices and EWB data.
Implement these controls at build time:
- Role-based access at GSTIN scope. A regional finance manager should only query records belonging to GSTINs in their state. Implement row-level security in your warehouse platform using the
state_code(digits 1β2 of the 15-character GSTIN) as the access predicate.
- Column-level masking for personal identifiers. Mask PAN, Aadhaar references, individual names and contact numbers in all views exposed to non-finance users (BI dashboards, data science sandboxes). Unmask only in privileged views with logged access.
- Query logging. Every query that touches a table tagged
PERSONAL_DATAmust generate an audit log entry: user, timestamp, predicate, rows returned. The DPDP Act's accountability principle requires you to demonstrate that personal data is accessed only for the declared purpose (GST compliance, not, say, credit profiling).
- Documented lawful basis. Maintain a data-processing register that records why each personal-data field exists in the warehouse, the legal basis (contractual necessity for supplier invoices, legal obligation for statutory compliance), and the retention period. This is not a warehouse-build task β but the warehouse architect must reserve columns and access policies for it from day one.
- Data principal rights. If a vendor (who is an individual proprietor) exercises a deletion or correction right under the DPDP Act, your warehouse must support a targeted purge without cascading breaks in aggregated return filings. Design soft-delete flags, not hard deletes, and ensure aggregated fact rows do not retain personally identifying natural keys.
Common Mistakes and Pitfalls to Avoid
1. Using invoice number as a dimension key. Invoice numbers are not unique across GSTINs, periods, or after amendments. Always use surrogate keys.
2. Storing only the latest 2B snapshot. GSTR-2B is retroactively amended when suppliers file belated GSTR-1 returns. If you overwrite 2B data on each pull, you lose the history of what was available at the time 3B was filed β which is exactly what an auditor will ask about.
3. Computing ITC eligibility in the BI layer. Section 17(5) blocked credit rules, Rule 42 reversals and Rule 86B cash payment restrictions involve conditional logic with many edge cases. Encoding this in a BI tool's calculated field means every analyst can implement it differently. Pre-compute a single, reviewed itc_eligible_flag in ETL, with a documented rulebook.
4. Ignoring the e-way bill source. A reconciliation that confirms an invoice exists in both the ERP and GSTR-2B but cannot confirm physical movement has a significant audit gap. Particularly for high-value goods, the EWB dimension is essential.
5. Building without a data catalogue. Within 18 months, a warehouse without documentation becomes a system only its original builder understands. Analysts stop trusting it; shadow Excel spreadsheets proliferate; the warehouse asset depreciates rapidly.
6. Treating QRMP and monthly filers identically. Under the QRMP (Quarterly Return Monthly Payment) scheme, small taxpayers file GSTR-1 quarterly but pay tax monthly via PMT-06 challan. Their GSTR-2B contribution appears only at quarter-end. Build separate processing logic for QRMP-supplier invoices to avoid false reconciliation alerts mid-quarter.
Platform Selection: A Practical Framework
Indian enterprises in 2026 evaluate Snowflake, Databricks, Microsoft Fabric, Google BigQuery and AWS Redshift. The right answer depends on four factors, not fashion:
- Existing cloud commitment: If your ERP is Azure-hosted, Fabric has a friction advantage. If you are AWS-native, Redshift Spectrum + S3 may be simplest.
- Data engineering skill: Databricks demands Spark expertise your team may not have. Snowflake is SQL-first and easier to staff.
- Governance maturity: Fabric and Databricks Unity Catalog offer integrated governance that standalone Redshift does not match without add-ons.
- Total cost at GST scale: Run a paid proof of concept on three months of e-invoices plus one full year of returns β roughly the data volume that exercises partitioning, concurrency and compression meaningfully. Benchmark cost per reconciliation query, not just storage price.
Do not dismiss mature on-premises options (PostgreSQL + columnar extensions, Microsoft SQL Server with columnstore indexes) for businesses where data sovereignty concerns or existing infrastructure costs make cloud migration uneconomic. A well-designed on-premises warehouse on proven hardware outperforms a poorly governed cloud instance every time.
Governance, Catalogue and Lineage
A governed warehouse earns trust; an ungoverned one generates shadow spreadsheets.
Data catalogue: Every table and view needs an owner, a sensitivity classification (CONFIDENTIAL β GST COMPLIANCE, INTERNAL, RESTRICTED β PERSONAL DATA), a retention label, a freshness SLA and a plain-language description. Automate catalogue updates from your ETL pipeline metadata rather than maintaining it manually.
Lineage: When the CFO asks why the ITC dashboard shows Rs. 2 lakh less than last week, lineage tracking lets you trace backward from the BI tile β the SQL view β the fact table β the 2B ETL job β the raw 2B file β the ASP/GSP API response. Without lineage, you are guessing.
Product ownership: Assign one named business owner (typically the head of indirect tax or Group CFO's office) who signs off on the data model, approves new use cases and owns the quarterly stakeholder review. Without a named owner, the warehouse drifts and nobody feels accountable for its accuracy.
Prune aggressively. Unused tables and dashboards consume storage, confuse users and create compliance risk (data you forgot you had). Quarterly usage reviews β delete any table with zero queries in 90 days β keep the asset focused and trustworthy.
Key Takeaways
- Use cases drive architecture. Define the questions β 2B reconciliation, ITC eligibility, refund ageing, vendor compliance β before selecting a platform.
- Star schema at line-item grain is the right model for GST analytics. Build it with surrogate keys and Type-2 SCDs for GSTIN and HSN.
- Pre-compute ITC eligibility in ETL, not in BI tools, and maintain a single documented rulebook for Section 17(5) and Rule 42 logic.
- Retain 72 months of data as mandated by Section 36 of the CGST Act; design hot/cold storage tiers from day one to manage cost as the archive grows.
- Persist signed e-invoice JSON in the landing zone β it is your primary audit evidence and cannot be reconstructed from transformed data.
- DPDP Act compliance is not optional for warehouse copies of GST data. Implement row-level GSTIN-scoped access, column-level masking and query logging at build time.
- A data catalogue with lineage is what converts a warehouse from a black box into a trusted, auditable asset. Without it, the warehouse depreciates faster than the hardware it runs on.





