Legal Suvidha is a registered trademark. Unauthorized use of our brand name or logo is strictly prohibited. All rights to this trademark are protected under Indian intellectual property laws.
Legal Suvidha
Goods & Service Tax (GST)

Data mining for GST

Data mining for GST in India is the systematic exploration of returns, e-invoices and e-way bill data using techniques like association rule mining, sequence mining, anomaly detection, network analysis and text mining. The CBIC uses these methods for risk scoring and fraud detection, while Indian businesses apply them to vendor risk assessment, liability forecasting, revenue leakage analysis and transfer-pricing intelligence, with DPDP-aligned pseudonymisation and access logging.

Mayank WadheraMayank Wadhera
Published: 26 Jun 2023
Updated: 23 May 2026
14 min read
Data mining for GST
1
2
3
4
5
6
7
8
9
10

How data mining techniques help Indian businesses and tax authorities extract value, detect anomalies and prevent fraud in GST data through 2026.

Data Mining for GST: A Practical Guide for Indian Businesses and Finance Teams (FY 2026-27)

Eight years and billions of invoice-level records later, GST has become one of the richest transactional datasets in any emerging economy. In FY 2026-27, the Central Board of Indirect Taxes and Customs (CBIC) operates a live analytics engine on GSTN data β€” flagging outliers, mapping supply networks and scoring taxpayer risk before many finance teams even open their return-preparation file. This guide explains the techniques being deployed, illustrates each with real-number worked examples, and shows you how to apply the same tools defensively and offensively inside your own organisation.


What "Data Mining" Actually Means in a GST Context

Data mining is not report-pulling or pivot-table work. It is the systematic application of statistical and computational methods to large datasets to extract patterns, associations, anomalies and predictive signals that are invisible in any single record.

Applied to GST, four analytics layers stack on top of each other:

  • Descriptive analytics β€” what happened (e.g., Rs. 3.2 crore ITC claimed in Q4 FY 2025-26 by a firm with a Rs. 1.8 crore annual turnover)
  • Diagnostic analytics β€” why it happened (ITC concentrated in three suppliers, all registered in the same month, same PIN code)
  • Predictive analytics β€” what is likely next (model assigns a 74% probability of a Section 61 scrutiny notice within 90 days if the pattern persists)
  • Prescriptive analytics β€” what you should do (reverse the doubtful Rs. 80 lakh before the 20th when your next GSTR-3B is due)

The CBIC's Directorate General of Analytics and Risk Management (DGARM) operates across all four layers. Most Indian businesses have barely reached the first.


The Five Core Techniques β€” and What They Actually Flag

1. Association Rule Mining

Association rule mining asks: which entities, HSN codes or transaction patterns appear together more often than chance would predict?

GSTN analysts use this to identify HSN-vendor combinations that are statistically improbable. Consider a distributor claiming ITC on HSN 8703 (passenger cars) while their declared principal supply is HSN 0901 (coffee). The mining algorithm computes: {ITC on HSN 8703} β†’ {primary output HSN 0901}, with a lift of 340 β€” meaning this co-occurrence is 340 times more common in confirmed-fraud cases than in the general GSTN population. Any lift above 50 in a trained GST fraud model is a serious flag.

For your own business, apply the same logic to your purchase register. Vendors whose invoiced product mix diverges sharply from their GSTIN-registered business category are worth manual scrutiny before you take ITC.

2. Anomaly Detection on ITC Patterns

Anomaly detection compares each GSTIN's ITC profile against a dynamically constructed peer set β€” firms in the same industry, turnover band, state and filing history. GSTN has enough data to build granular peer groups with hundreds of members.

Worked number: Your peer group (Maharashtra manufacturer, NIC 2410 β€” iron and steel, turnover Rs. 10–50 crore) has a median ITC-to-output-tax ratio of 0.82, with a standard deviation of 0.19. Your April 2026 ratio is 1.47. Z-score = (1.47 βˆ’ 0.82) Γ· 0.19 = 3.42. That places you in the top 0.03% of your peer distribution. Even if the spike is legitimate β€” say, you purchased a new press line and capitalised a Rs. 40 lakh ITC claim on capital goods β€” you need documentation available before a notice arrives: GST-compliant tax invoices with valid IRN, GSTR-2B reflection, e-way bills and NEFT/RTGS payment proof.

3. Sequence Mining Over E-Invoice and E-Way Bill Events

GST generates time-stamped events at multiple points: e-invoice generation at an IRP (Invoice Registration Portal, accessible at einvoice1.gst.gov.in), e-way bill creation at ewaybill.nic.in, GSTR-1 reporting by the 11th, and GSTR-3B tax payment by the 20th. Sequence mining looks at the order and timing of these events to spot what cannot plausibly be legitimate.

Red flags the DGARM system has publicly been reported to examine:

  • E-way bill generated more than 24 hours after the invoice date for perishables
  • IRN generated but no corresponding e-way bill for consignments exceeding Rs. 50,000 in value where one is legally required
  • GSTR-1 reporting a supply but the same supplier's GSTR-3B showing zero tax payment β€” the classic "ITC-pass-through without remittance" pattern that Section 16(2)(c) is designed to prevent

For businesses, sequence mining on your own order-to-cash data surfaces revenue leakage: goods delivered in March 2026 but e-invoice generated on 2 April 2026, for example, shifts the supply to FY 2026-27 in the GSTR-1 record and mismatches the buyer's GSTR-2B for March β€” creating a cascading ITC timing problem you may not discover until a demand notice arrives.

4. Network Analysis for Circular Trading Detection

Network analysis treats GSTINs as nodes and invoice flows as directed edges on a graph. When Company A supplies to B, B to C, and C back to A β€” all within a short window β€” the cycle-detection algorithm identifies it regardless of how many intermediate entities the promoters insert to obscure the loop.

Worked Rs. example: GSTINs A, B and C trade notionally in a triangle. Underlying goods value: Rs. 50 lakh. A raises an invoice on B for Rs. 57.5 lakh (15% markup), B on C for Rs. 66.1 lakh, C on A for Rs. 76 lakh. Each party claims ITC on the invoice it receives. Total fraudulent ITC generated across the three entities in one cycle at 18% GST: approximately Rs. 4.57 lakh on the Rs. 26 lakh value inflation β€” and promoters typically run dozens of cycles across hundreds of GSTINs before shutdown.

The legal exposure is severe. Section 122(1)(vii) of the CGST Act, 2017 imposes a penalty equal to 100% of the tax amount on parties issuing or using fake invoices. Section 132 makes ITC fraud exceeding Rs. 5 crore a cognisable, non-bailable offence with imprisonment up to five years.

CBIC's network graphs also layer in non-financial attributes: GSTINs with an identical IP address for return filings, the same bank account linked for refund credits, and the same authorised signatory across multiple entities are clustering signals that push the combined fraud score over the prosecution threshold fast.

5. Text Mining Over Narrations and Credit-Note Descriptions

GSTR-1 data, e-invoice JSON payloads from the IRP, and the remarks fields of credit notes contain free text. Natural language processing identifies keyword clusters β€” "quality rejection," "price revision," "goods returned in transit" β€” and tests whether their frequency and distribution is consistent with the taxpayer's sector peer set.

A manufacturer issuing credit notes citing "quality rejection" for 38% of its total invoiced value when the peer median is 4.1% will be flagged for follow-up. From the business side, the same technique applied to your accounts-payable narrations and inbound credit notes identifies recurring dispute patterns that your AP team sees one invoice at a time but never as a statistical whole β€” such as a single vendor contributing 60% of all quality-rejection deductions.


How CBIC's DGARM Actually Triggers a Notice β€” The Workflow

DGARM's publicly described process runs roughly as follows:

  1. GSTR-1 data from suppliers flows into GSTN; GSTR-2B is auto-generated for buyers by the 14th of the following month.
  2. DGARM risk models update each GSTIN's composite score after every return cycle.
  3. GSTINs crossing defined score thresholds enter state or central GST authority queues for Section 61 scrutiny (risk-based assessment), Section 65 audit, or Section 67 inspection.
  4. Scrutiny notices under Section 61 now cite specifics: the exact quarter of the 2A-2B mismatch, the HSN code at issue, and in some cases the supplier GSTIN whose filing pattern triggered the flag.

Responding to such a notice without understanding the underlying analytics β€” and without the supporting documentation described above β€” is fighting a data-driven case with anecdote. The taxpayer who runs the same analytics on their own data in advance arrives at the notice-response stage with a pre-prepared reconciliation rather than a three-week scramble.


Worked Example: From Anomaly to Section 61 Notice to Resolution

Scenario: A Delhi-based trading firm, GSTIN 07XXXXX, turnover Rs. 22 crore in FY 2025-26, claims ITC of Rs. 3.8 crore in GSTR-3B for Q3 FY 2025-26 (October–December 2025). Their auto-generated GSTR-2B for the same period reflects only Rs. 2.9 crore. The Rs. 90 lakh gap arises because certain suppliers filed their GSTR-1 late; that ITC appears in GSTR-2B for January 2026.

What DGARM's engine sees: A 2A-2B mismatch of 31% in a quarter where the firm's peer median mismatch is 4.2%. Risk score spikes by a modelled 2.1 standard deviations. A Section 61 scrutiny notice arrives in February 2026 demanding ITC reconciliation for Q3.

Documents required to defend the position:

  • Line-by-line ledger reconciliation mapping each of the Rs. 90 lakh to the specific January or February 2026 GSTR-2B entry in which it eventually appeared
  • Copies of all underlying tax invoices with valid, IRP-verified IRNs
  • NEFT/RTGS payment confirmation for each supplier in the disputed set β€” rule-based enforcement increasingly treats unpaid-for ITC as a proxy for bogus supply
  • Confirmation that no supplier GSTIN in the Rs. 90 lakh set was cancelled, suspended or blacklisted at the time of the transaction

Cost of failure: If the Rs. 90 lakh is not reconciled to DGARM's satisfaction and Section 74 proceedings (fraud or suppression) are initiated, the exposure is: GST demand of Rs. 16.2 lakh (18% on Rs. 90 lakh) + penalty of Rs. 16.2 lakh (100%) + interest at 24% per annum from the date of wrongful availment. On a 12-month delay before resolution, interest alone adds approximately Rs. 3.9 lakh. Total worst-case exposure: Rs. 36.3 lakh on a reconciliation item that proper monthly analytics would have caught in October 2025.


How Your Business Should Apply These Techniques β€” Practically, Today

Step 1: The Monthly GSTR-2B Reconciliation Before Every GSTR-3B Filing

Before filing each GSTR-3B (due the 20th of every month for monthly filers, or the 22nd/24th for QRMP scheme participants depending on state), complete three checks:

  1. Download GSTR-2B (available by the 14th from the GST portal) and reconcile line by line against your purchase register. Gaps > 5% of monthly ITC need a documented explanation.
  2. Verify the GSTIN status of every supplier who contributed more than Rs. 1 lakh to your ITC claim for the month. The GSTN portal supports batch GSTIN verification via its public API.
  3. Cross-check IRNs on invoices above Rs. 5 lakh using the IRP verification facility at einvoice1.gst.gov.in. A fraudulent invoice will not have a valid, verifiable IRN.

Step 2: Build Three Internal Ratios and Track Them Monthly

Monitor these ratios in a simple tracker β€” Excel is sufficient at the start:

MetricFormulaInternal Alert Threshold
ITC-to-liability ratioTotal ITC claimed Γ· Output tax in GSTR-3B> 20% above your own 12-month rolling average
Supplier concentrationITC from top 3 suppliers Γ· Total ITC> 60% for two consecutive months
2A-2B mismatch rate(Book ITC βˆ’ GSTR-2B ITC) Γ· Book ITC> 8% in any single month

When any metric crosses its threshold, investigate before filing β€” not because CBIC will certainly notice, but because the underlying cause (a supplier filing late, a classification error, a credit note not booked) either costs you money or creates a future liability with compounding interest.

Step 3: Map Your Own GSTIN Network for Group Entities

If your group has more than five GSTINs β€” branches, subsidiaries, special purpose entities β€” draw a directed graph of inter-entity invoice flows. This takes an afternoon in any graph visualisation tool, including free ones like Gephi. Look for cycles (A supplies B, B supplies C, C supplies A), even innocent ones that arise from inter-branch transfers. An innocent cycle will still flag in DGARM's algorithm; having the business rationale documented beforehand saves weeks during any inquiry.


Common Mistakes That Convert an Analytics Output into a Real Notice

Mistake 1: Treating GSTR-2B as infallible. GSTR-2B is auto-generated from supplier GSTR-1 data. If a supplier entered your GSTIN incorrectly or reported the wrong invoice value, your 2B is wrong. Always reconcile against the actual invoice, not just accept the system figure.

Mistake 2: Claiming ITC before GSTR-2B reflection "because the invoice is genuine." Section 16(2)(aa) of the CGST Act β€” inserted by the Finance Act 2021 and operative from 1 January 2022 β€” makes GSTR-2B reflection a statutory condition for ITC availability, not a best practice. Claiming ITC before it appears in GSTR-2B exposes you to a demand under Section 73 or 74 if the supplier never files.

Mistake 3: Ignoring supplier suspension notices. When a GSTIN is suspended, the supplier's GSTR-1 is blocked. ITC you have taken on their invoices during the suspension period will not appear in any future GSTR-2B update. The liability sits with you under Section 16(2)(c). Run a bulk status check β€” at minimum quarterly, ideally monthly.

Mistake 4: Running analytics only at year-end. Year-end analysis finds problems after they have crystallised into filed GSTR-3B liabilities, interest exposure, and an audit paper trail. Monthly analysis catches them as reversible journal entries. Build the reconciliation into your return-preparation SOP β€” it belongs in the same checklist as attaching the tax payment challan.

Mistake 5: Exporting GSTN data to third-party tools without a DPDP governance structure. GSTN data for a proprietorship or partnership links directly to an identifiable individual. This is personal data under the Digital Personal Data Protection Act, 2023 (DPDP Act). Exporting it to an analytics vendor without documented lawful basis, data processing agreements and minimisation protocols is a compliance risk distinct from, and additional to, your GST exposure.


DPDP Act, 2023: The Governance Overlay You Cannot Ignore

GST analytics workflows almost always touch personal data: a proprietor's GSTIN is directly Aadhaar-seeded and individually identifiable. The DPDP Act, 2023 applies to any digital processing of this data, including internal analytics.

Practical compliance steps:

  1. Pseudonymise GSTINs in analytics working tables where individual-level identification is not required β€” replace with an internal entity code.
  2. Log all access to raw GSTN extracts. Retention must cover the longer of (a) 72 months from the due date of the relevant annual return (the GST limitation period under Sections 73 and 74) or (b) any purpose-specific horizon.
  3. Document the lawful basis for each data processing activity β€” internal tax risk management constitutes a legitimate interest, but the documentation must exist and be reviewable by an auditor.
  4. Require human review before automated outputs drive decisions. No algorithm should autonomously trigger a supplier de-listing, an ITC reversal instruction or a vendor payment block. A qualified reviewer must sign off. This satisfies both the DPDP Act's accountability principle and the "explainability" standard that ICAI expects in AI-assisted audit work.
  5. Data minimisation. Analytics workflows should extract only the fields necessary for the specific analysis. Do not pull full GSTN master data when only the turnover-band and monthly ITC totals are required.

Building the Capability: What a Rs. 25–500 Crore Business Actually Needs

You do not need a data science department. A functional GST analytics capability for a mid-sized Indian company requires:

  • A data engineer or analyst with SQL skills to pull, clean and join GSTR-2B downloads, the purchase register, GSTR-1 outward supply data, and e-way bill records into a single working table updated monthly.
  • A GST domain expert β€” your in-house CA or tax consultant β€” to define which patterns matter and translate findings into legal exposure estimates.
  • Standard, low-cost tools: Excel for ratio monitoring; Python (pandas, networkx) for network analysis; Power BI or Google Looker Studio for dashboards. The GSTN portal provides APIs for GSTR-2B download, GSTIN status verification and IRN validation that can be integrated without enterprise software.
  • A monthly rhythm: one-page output β€” three ratios, flagged suppliers, resolved or pending status β€” shared with the CFO and tax consultant before each GSTR-3B due date.

Generative AI tools are beginning to assist by summarising notices in plain language and drafting reconciliation narratives. Use them for productivity, not for the analytical judgement. The underlying data work β€” feature selection, validation, document retrieval β€” must remain with a qualified human.


Key Takeaways

  • CBIC's DGARM operates continuous, peer-benchmarked analytics on GSTN data. Your next GSTR-3B is evaluated against your peer group the moment it is submitted. Understanding the techniques means you can anticipate the flags before they become notices.
  • Section 16(2)(aa) of the CGST Act makes GSTR-2B reflection a hard statutory condition for ITC β€” not an administrative preference. Any gap between your books and GSTR-2B is a live Section 73 or 74 exposure.
  • Three internal ratios β€” ITC-to-liability, supplier concentration and 2A-2B mismatch rate β€” tracked monthly before each GSTR-3B filing, give you an early-warning system that mirrors the logic DGARM uses.
  • Circular trading penalties are severe: 100% of the tax amount under Section 122, and criminal prosecution for ITC fraud exceeding Rs. 5 crore under Section 132. Network analysis makes these patterns straightforward to detect β€” for CBIC and for you.
  • The DPDP Act, 2023 applies to your GST analytics workflows. Pseudonymise individual identifiers, document lawful basis, retain access logs for at least 72 months, and require human sign-off before any automated output drives a financial or commercial decision.
  • The cost of reactive analytics is quantifiable: on a Rs. 90 lakh ITC dispute, worst-case exposure including penalty and 12 months' interest at 24% per annum reaches Rs. 36 lakh β€” a multiple of the cost of a well-run monthly reconciliation process.
  • Start small and consistent. A monthly one-page dashboard, updated before each return due date, is more valuable than a sophisticated annual analytics project that arrives after audit season has begun.

Frequently Asked Questions

What is data mining in the GST context?
It is the systematic exploration of GST returns, e-invoices and e-way bill data to identify patterns, associations, anomalies and predictive signals. Techniques include association rule mining, anomaly detection, network analysis, sequence mining and text mining on narrative fields.
Does the CBIC use data mining?
Yes. The CBIC's analytics directorate has publicly described risk scoring, network analysis and anomaly detection over GSTN data. Notices frequently cite specific patterns like 2A-2B mismatch spikes or e-way bill mileage anomalies that emerge from data mining workflows.
Where can Indian businesses apply GST data mining first?
Start with vendor risk scoring from the purchase register, anomaly detection on monthly liability before filing, and a three-way reconciliation across ERP, IRN and GSTR-1. These deliver quick wins in cash flow, audit readiness and dispute defence with manageable effort and cost.
Does the DPDP Act constrain data mining?
Yes. Where GST data carries personal identifiers, processing must serve a documented lawful purpose, be limited to what is necessary, be secured through reasonable safeguards, and respect data-principal rights, subject to statutory retention. Outputs driving decisions must be explainable and human-reviewed.
Mayank Wadhera
Content Reviewed By

CA | CS | CMA | Lawyer | Insolvency Professional | IBBI Valuator

"I help founders increase real business value and achieve stronger valuations | Turning messy workflows into scalable, time-saving systems"

Share this article:

Related Posts

View All