Cluster analysis groups GST taxpayers and invoices by behaviour, helping Indian businesses and tax authorities spot anomalies, fraud and risk patterns.
Cluster Analysis for GST Data
Cluster analysis is an unsupervised machine-learning technique that groups GST taxpayers, invoices, or transactions by behavioural similarity โ without needing pre-labelled fraud examples to learn from. Applied to GSTN's combined GSTR-1, GSTR-3B, GSTR-2B, e-invoice and e-way bill data, it surfaces fake-ITC networks, circular-trading rings, and refund anomalies that deterministic rule engines consistently miss. As of FY 2026-27, the CBIC (Central Board of Indirect Taxes and Customs) openly uses K-Means and DBSCAN in its risk-profiling stack; Indian businesses now need the same analytical lens internally to stay ahead of assessments and protect working capital.
What Cluster Analysis Means When Applied to GST Data
Clustering partitions records into groups where members within a group resemble each other more than they resemble members of any other group. Applied to GST, a record can be:
- A GSTIN summarised over a month or quarter (the standard grain for taxpayer risk profiling)
- A single invoice or HSN line item (for transaction-level anomaly detection)
- An e-way bill (for goods-movement consistency checks)
The fundamental difference from rule-based compliance analytics is this: clustering discovers patterns it has never been told to look for. A rule engine flags an ITC-to-output-tax ratio above a fixed threshold. A clustering model discovers that a pocket of 400 GSTINs across a sector clusters together because they share a combination of behaviours โ elevated ITC concentration, low e-way bill distances, above-average IRN cancellation rates โ none of which individually crosses any threshold. That combination, invisible to rules, is exactly what an unsupervised model surfaces.
The GSTN's analytics infrastructure now holds over a decade of return filings, IRP (Invoice Registration Portal) e-invoice metadata, and e-way bill records in a unified data warehouse. That depth of longitudinal data makes the feature space for clustering exceptionally rich โ seasonal patterns, growth trajectories, and counterparty network changes all become usable signals.
Unlike classification (which requires a labelled training set of "known fraud" cases), clustering works with entirely unlabelled data. This matters in a GST context because novel evasion schemes โ new structures for circular trading, new methods for splitting invoices below e-invoice thresholds โ have no historical label. A classifier trained on yesterday's fraud looks like yesterday's fraud. A clustering model trained on behaviour notices any unusual behaviour, regardless of whether anyone has seen it before.
The Four Algorithms in the GST Analytics Stack
K-Means
K-Means partitions n GSTINs into k clusters by minimising within-cluster variance. It is fast, interpretable, and scales to millions of records. The constraint is that you must specify k in advance and every GSTIN is assigned to exactly one cluster โ there is no concept of an outlier.
Best used for: Segmenting an entire state's taxpayer base into four to eight risk bands before the start of an audit cycle, or classifying exporters versus traders versus manufacturers by behavioural profile.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN groups records that are densely packed in feature space and marks sparse, isolated points as noise โ the algorithm's term for outliers. In GST fraud detection, those noise points are typically the most interesting: GSTINs or invoice clusters that belong to no normal group. Circular-trading rings generate exactly the kind of tight, isolated density cluster that DBSCAN is designed to detect.
Best used for: Detecting circular-trading rings, fake-supplier networks, and any setting where you need the algorithm to identify its own outliers rather than forcing everything into a predefined segment.
Hierarchical Clustering
Hierarchical methods produce a dendrogram โ a tree diagram showing how clusters merge at increasing levels of similarity. The output is explainable at every level of granularity and far easier to present to a senior officer or appellate authority than a K-Means centroid vector. The trade-off is computational: hierarchical clustering does not scale to millions of GSTINs without prior sampling or aggregation.
Best used for: Sector-specific deep dives, export-refund anomaly analysis, or any setting where a defensible, step-by-step explanation of cluster formation is required for documentation purposes.
Gaussian Mixture Models (GMM)
GMMs assign each GSTIN a probability of belonging to each cluster rather than a hard label. A borderline taxpayer might score 62% Cluster 2 (medium-risk) and 38% Cluster 4 (high-risk). This probabilistic output is more actionable than a binary flag because it allows investigators to rank-order uncertain cases rather than treating every medium-risk taxpayer identically.
Best used for: Prioritising investigation capacity when the number of flagged GSTINs exceeds available officer hours. Soft scores let you work from the highest-probability high-risk end of the distribution.
Feature Engineering: The Variables That Actually Discriminate
Raw GST data โ turnover amounts, ITC rupee values, tax payable โ clusters poorly by itself because a large legitimate manufacturer and a large fake-ITC operator can look identical on absolute numbers. Discriminating power comes from engineered features: ratios, rates of change, network metrics, and behavioural markers. A practical starting set of approximately 20 features across five categories:
Return and liability behaviour
- ITC claimed รท output tax payable (the single most powerful feature for fake-ITC detection)
- ITC utilised รท ITC available in the electronic credit ledger
- Month-on-month percentage change in ITC claimed (sudden spikes are high-signal events)
- Standard deviation of monthly GSTR-3B filing date (chronic late filers cluster differently from habitual early filers)
- Ratio of credit notes to original invoices declared in GSTR-1
Supply chain and counterparty network
- Number of distinct supplier GSTINs contributing to total inbound ITC
- Proportion of inbound supply value concentrated in the top three vendors
- Proportion of outward supplies to related parties (relevant under Section 15 of the CGST Act, 2017)
- IRN cancellation rate on self-generated e-invoices at the IRP
- Proportion of claimed ITC with no corresponding entry in GSTR-2B from active counterparties (the Section 16(2)(aa) compliance ratio)
E-way bill and logistics consistency
- Declared e-way bill distance versus calculated road distance for the same origin-destination pair
- Proportion of e-way bills with distance below 10 km despite an interstate supply code
- E-way bill extension frequency as a proportion of total bills generated
Temporal and seasonal patterns
- Three-month, six-month and twelve-month rolling averages of the ITC/output ratio
- Quarterly turnover standard deviation (captures seasonality-inconsistent behaviour)
- Gap in months between GSTIN registration date and first GSTR-1 return filing
Cross-return reconciliation gaps
- GSTR-1 declared turnover versus GSTR-3B declared turnover (net mismatch as percentage of turnover)
- GSTR-2B-to-GSTR-3B pass-through ratio: what proportion of claimed ITC actually appears in a live counterparty's GSTR-1
Indian tax analytics practitioners consistently find that approximately 20 well-engineered features outperform models built on hundreds of raw columns, because each feature encodes business logic โ an understanding of why a ratio matters โ that no algorithm can independently derive from raw rupee amounts.
Step-by-Step: Building a GST Clustering Pipeline
The following sequence applies whether you are a tax department building a state-level taxpayer risk model or a corporate CFO analysing 500 active vendors:
- Pull and unify data. Extract GSTR-1 summary, GSTR-3B liability and payment data, GSTR-2B ITC statements, IRP e-invoice metadata, and e-way bill records into a single data store โ a local DuckDB for small exercises, BigQuery or Snowflake for state-scale datasets.
- Aggregate to GSTIN-month grain. Produce exactly one row per GSTIN per month. Never cluster raw invoice rows. Doing so clusters individual transactions, not taxpayer behaviour patterns, and a single large taxpayer will dominate distance calculations.
- Encode nil returns explicitly. A nil return is not the same as a missing return. Create a binary
nil_filerflag and a count of consecutive nil months. Nil filers form their own cluster and may represent dormant-registration fraud.
- Standardise all features. Apply z-score normalisation (
StandardScalerin scikit-learn). Turnover runs in crores; the ITC/output ratio runs between 0 and 10; the IRN cancellation rate runs between 0 and 1. Without standardisation, the high-magnitude features dominate Euclidean distance and suppress the behavioural signals.
- Reduce dimensions for visualisation. Apply PCA (Principal Component Analysis) to retain 85โ95% of variance, then UMAP for 2-D plotting. A domain expert who cannot read a 20-dimensional centroid vector can usually interpret a well-labelled 2-D scatter plot.
- Tune the algorithm rigorously. For K-Means: compute silhouette scores across k = 3 through k = 15 and use the elbow plot plus the silhouette peak to select k. A silhouette score above 0.5 indicates well-separated clusters; below 0.3 means your features are not discriminating enough. For DBSCAN: grid-search
epsilonandmin_samples; start withepsilonderived from the k-nearest-neighbour distance plot.
- Validate with a GST domain expert. Before labelling any cluster as "high risk", pull five to ten GSTINs from each cluster and have a GST consultant review their actual return history. A cluster of GSTINs with high ITC ratios might be legitimate capital-goods manufacturers in a pre-production phase โ a fact no algorithm can determine.
- Document methodology. If cluster outputs inform a show-cause notice or an internal audit report, the derivation must be traceable and withstand scrutiny by the taxpayer's counsel or an appellate authority. Undocumented model choices are the fastest way to invalidate a well-intentioned analytical exercise.
Worked Example: K-Means Exposes a Fake-ITC Operator
Consider a textile trader GSTIN registered in Maharashtra in October 2023. By April 2026 (start of FY 2026-27), a K-Means exercise on the Maharashtra GST database places this GSTIN in Cluster 4 โ the smallest risk cluster, representing 0.4% of active filers โ characterised by extreme ITC/output ratios and high supplier concentration.
The feature profile that triggered the flag:
- Average monthly output tax payable: Rs. 14 lakhs
- Average monthly ITC claimed: Rs. 68 lakhs
- ITC/output ratio: 4.85x (Cluster 4 centroid: 4.2x; textile-trader sector benchmark: 0.65โ0.80x)
- Top three supplier GSTINs contributing 91% of inbound ITC: all registered in Tamil Nadu in 2023, all cancelled by CBIC under Rule 22 within eight months of registration
- GSTR-2B-to-GSTR-3B ITC pass-through ratio: only 13% of claimed ITC appears in any live counterparty's GSTR-1 โ the remaining 87% is unsupported under Section 16(2)(aa) of the CGST Act
- Own IRN cancellation rate: 34% (sector average: 3%)
CBIC issues a show-cause notice under Section 74 of the CGST Act, 2017 (fraud or willful misstatement). The demand:
| Component | Amount |
|---|---|
| ITC disallowed (Rs. 68L โ Rs. 10.5L supportable per month ร 12 months) | Rs. 6.90 crore |
| Interest under Section 50(3) at 18% p.a. for 300 days | Rs. 1.02 crore |
| Penalty under Section 74 at 100% of tax evaded | Rs. 6.90 crore |
| Total demand | Rs. 14.82 crore |
Under Rule 86A of the CGST Rules, 2017, the commissioner can block the electronic credit ledger immediately upon recording reasons to believe โ operationally freezing the business before the order is even passed.
The cluster flag was the trigger. In a database of several lakh active Maharashtra filers, this GSTIN had no individually obvious signature. It was the statistical combination โ the ratio, the supplier concentration, the GSTR-2B gap, and the IRN cancellation rate together โ that placed it unambiguously in the outlier cluster.
Worked Example: DBSCAN Unravels a Circular-Trading Ring
DBSCAN applied to Rajasthan's GST transaction network identified a dense sub-graph of eight GSTINs โ four in Jaipur, two in Jodhpur, two in Kota. Each GSTIN supplied goods to the next, with GSTIN H ultimately supplying back to GSTIN A, completing the loop.
The ring's aggregate profile over six months:
- Total paper ITC generated across the eight entities: Rs. 4.8 crore
- Net GST deposited in cash by the entire ring: Rs. 3.2 lakhs โ all other tax liabilities were settled using ITC generated within the ring itself
- Average e-way bill distance across all 140 transactions: 8.4 km (all eight GSTINs operate within a 12-km radius)
- Declared supply type on all invoices: interstate B2B goods supply
DBSCAN flagged all eight as a dense, isolated cluster separate from the broader Rajasthan taxpayer population because the mutual transaction intensity among these eight โ each appearing as a dominant counterparty to the others โ created exactly the tight, within-cluster density surrounded by low inter-cluster density that the algorithm is designed to detect.
Under Section 122(1)(ii) of the CGST Act, knowingly issuing tax invoices without actual supply of goods or services attracts a penalty of Rs. 10,000 or the amount of tax evaded, whichever is higher โ per transaction. With 140 transactions across the ring, the penalty exposure alone, before the principal demand and interest, is substantial.
Use Cases for Indian Businesses โ Not Only Tax Departments
Cluster analysis is not exclusively a regulatory enforcement tool. Finance teams running multi-state operations can apply the same approach internally with significant returns:
Vendor ITC risk scoring. Group your active vendors by their IRN cancellation rate, GSTR-2B appearance consistency, and GST registration vintage. Vendors clustering in a high-risk profile should trigger enhanced due diligence before ITC is claimed under Section 16(2)(aa) โ which requires the credit to appear in GSTR-2B as a condition of eligibility. Claiming ITC from a vendor who subsequently gets cancelled retroactively creates a demand with interest and penalty that the business cannot recover from the supplier.
Branch GSTIN benchmarking. If your company holds 15 state GSTINs, clustering them by ITC utilisation rate, monthly turnover variance, and credit note frequency identifies branches that are statistical outliers within your own footprint. In practice, these outliers almost always point to process failures โ unauthorised credit note issuance, duplicated invoices, or incorrect HSN classification โ rather than deliberate fraud, but they carry the same regulatory exposure.
Export-refund pre-screening. LUT (Letter of Undertaking)-based zero-rated exporters can cluster their own monthly export supplies by e-way bill distance consistency, HSN stability, and bank-account details on refund applications. Anomalous months identified before submission allow corrections that avoid CBIC scrutiny during the refund adjudication stage โ where a mismatch between GSTR-1 exports and shipping bills is common grounds for delay or rejection.
Treasury and working-capital planning. Clustering months by ITC accretion pattern, output tax liability volatility, and refund receipt timing produces a far more granular cash-flow forecast than simple averages. A month in Cluster A (high ITC accretion, moderate output) has a predictably different net GST cash impact than a month in Cluster C (low ITC, high output tax payable).
Common Mistakes That Invalidate a Clustering Exercise
Clustering raw invoice rows instead of GSTIN-level features. This clusters individual transactions rather than taxpayer behaviour. A single large taxpayer with thousands of invoices dominates all distance calculations.
Skipping normalisation. Running K-Means on unstandardised data means turnover in crores overwhelms the IRN cancellation rate โ a number between 0 and 1. The result is clusters that simply separate large, medium and small taxpayers by absolute size, which is not useful.
*Treating the algorithm's choice of k as arbitrary. Running K-Means at k* = 5 "because five feels like a reasonable number of risk bands" is the single most common methodological failure. Always compute silhouette scores across a range and let the data suggest the number of natural clusters.
Using point-in-time features only. A taxpayer who has consistently claimed a high ITC ratio for three years is a different risk than one who suddenly spiked in the last two months. Time-windowed rolling features โ three-month, six-month, twelve-month โ are essential to distinguish structural behaviour from recent anomalies.
No domain expert validation before action. A cluster of high-ITC-ratio GSTINs could be legitimate capital-equipment importers or diamond exporters where the sectoral ITC structure is genuinely unusual. No model can know this; a GST consultant reviewing five GSTINs from the cluster can determine it in 30 minutes.
Treating cluster membership as evidence in proceedings. Cluster membership is a hypothesis that justifies investigation, not a finding that supports assessment. Citing an algorithm's output in a show-cause notice without independent substantive evidence is a procedural vulnerability that any competent taxpayer's advocate will immediately exploit.
Governance Under the DPDP Act, 2023
The Digital Personal Data Protection Act, 2023 and its implementing rules classify GSTIN-linked information of proprietors and partners as personal data of identifiable individuals. Any clustering exercise must therefore observe the following before data enters the pipeline:
- Purpose limitation. Process GSTIN-level data only for the stated compliance objective. Repurposing cluster outputs for credit scoring, marketing, or inter-departmental sharing requires a fresh and independent lawful basis.
- Pseudonymisation before cloud processing. Replace GSTINs with randomised internal identifiers before loading data into any shared or cloud-based ML environment. Re-link to actual GSTINs only during the investigation stage, within a controlled access environment with documented authorisation.
- Access logging. Maintain an immutable audit trail of who ran which query, on what dataset, and when. This satisfies the DPDP obligation and protects the organisation if a cluster-output-informed decision is challenged in an appellate forum.
- Defined retention limits. Individual-level cluster scores should not be retained beyond the compliance purpose cycle. Aggregate cluster statistics and model artefacts can be retained longer, but with clear version control.
For regulated entities sharing data upstream with CBIC under data-exchange agreements, each sharing arrangement must document its own legal basis under the DPDP Act independently โ the tax-administration purpose of one department does not automatically extend to an analytics function in another.
Operationalising Cluster Outputs: From Model to Action
A clustering model that produces a spreadsheet of risk labels and is filed away is a waste of analytical effort. Every cluster must map to an explicit, time-bound workflow:
Low-risk cluster โ Routine quarterly monitoring. No immediate action. Flag any migration out of this cluster in the next run as a priority review item.
Medium-risk cluster โ Trigger a GSTR-2B reconciliation review within 30 days. Check e-way bill consistency for the preceding six months. No external action without additional corroborating evidence.
High-risk cluster โ Initiate internal investigation immediately. Freeze ITC claims from flagged counterpart vendors pending verification under Section 16(4) (which imposes a time-bar on ITC claims โ the earlier of the due date of the September return following the financial year of the invoice, or the date of filing the annual return). Prepare a defensible computation memo.
Outlier or fraud-signal cluster โ For tax departments: initiate Rule 86A blocking proceedings. For corporates: escalate to CFO level and evaluate voluntary disclosure. Paying the tax with interest before a show-cause notice is issued under Section 74 reduces the penalty from 100% to 15% of the tax amount โ a significant difference on any material demand.
Track cluster migration over time. A GSTIN moving from Cluster 1 to Cluster 3 in a single quarter is a materially stronger signal than static cluster membership. Build a quarter-on-quarter migration tracker โ a Sankey diagram in Power BI works well โ so the monitoring function surfaces changers, not just current members.
Key Takeaways
- Clustering discovers what rules cannot. Because it requires no pre-labelled fraud examples, it surfaces novel evasion structures โ new forms of circular trading, split-invoice schemes, IRN manipulation โ as soon as they emerge in the data.
- Feature engineering is 80% of the result. Twenty well-engineered ratios โ ITC/output, GSTR-2B pass-through rate, e-way bill distance consistency, supplier concentration โ outperform hundreds of raw rupee columns in every practical application.
- Match the algorithm to the objective. Use K-Means for broad taxpayer risk banding; use DBSCAN specifically for isolating circular-trading rings and extreme behavioural outliers.
- The penalty arithmetic is asymmetric. A Section 74 demand on wrongly availed ITC attracts 100% penalty plus interest โ as the Maharashtra example above shows, a Rs. 6.90 crore tax demand becomes a Rs. 14.82 crore total exposure. Internal detection via clustering is orders of magnitude cheaper.
- Domain expertise is a mandatory input, not an optional add-on. Every cluster must be reviewed by a GST practitioner before any action is taken; a model cannot distinguish a capital-goods manufacturer from a fake-ITC operator by feature values alone.
- DPDP Act compliance applies to every clustering exercise. Pseudonymise identifiers before cloud processing, log every access, define retention limits, and document the lawful basis before the first query is run.
- Operationalise or do not start. If cluster outputs do not feed a defined, time-bound workflow โ a vendor watchlist, a refund pre-screen, an internal audit trigger โ the exercise consumes analytical resource without producing compliance value. Map every cluster to an explicit next action before the model goes live.





