# Data Inventory — Approved External Sources

This is the running log of every external dataset approved by the user under the [data-approval-protocol](../.claude/rules/data-approval-protocol.md). The `data-editor` agent reads this file to verify that fetches reference an approved entry.

## How approvals are recorded

Each approved fetch adds one row below. The `Approval ref` is the link the corresponding provenance YAML uses in its `approval_reference` field.

| Approval ref | Date | Provider | Tier | Indicator | Countries | Years | Approved by | Audit score | User sign-off |
|--------------|------|----------|------|-----------|-----------|-------|-------------|-------------|---------------|
| P1 | 2026-04-27 | WB WDI (via R `wbstats`) | 1 | `GC.TAX.TOTL.GD.ZS` (Tax revenue, % of GDP) | 17 EAP | 2004–2024 | user (chat) | 97/100 | ✅ with caveats |
| P2 | 2026-04-27 | WB WDI (via R `wbstats`) | 1 | `NY.GDP.MKTP.CD`, `NY.GDP.PCAP.CD` (GDP, GDP per capita) | 17 EAP | 2004–2024 | user (chat) | 97/100 | ✅ with caveats |
| P3 | 2026-04-27 | WB WDI (via R `wbstats`) | 1 | `GC.REV.XGRT.GD.ZS` (Government revenue ex-grants, % of GDP) | 17 EAP | 2004–2024 | user (chat) | 88/100 | ✅ with caveats |
| P4 | 2026-04-27 | WB Data360 (IMF WoRLD mirror, via R `httr2`) | 1 | WoRLD 14 indicators (Total Revenue, Tax Total, CIT, PIT, VAT, Excise, Trade, Property, Payroll, Soc. Contrib., Grants, etc.) | 17 EAP | 2004–2024 (effective 2004–2021) | user (chat) | 86/100 | ✅ with caveats |
| P5 | 2026-04-27 | WB WDI (via R `wbstats`) | 1 | `FP.CPI.TOTL.ZG`, `PA.NUS.FCRF` (inflation, exchange rate) | 17 EAP | 2004–2024 | user (chat) | 96/100 | ✅ with caveats |
| P6 | 2026-04-27 | IMF GFS | 2 | _conditional — only if Tier-1 has gaps; requires fresh per-fetch approval_ | _TBD_ | _TBD_ | user (chat) | _conditional_ | _conditional_ |
| P7 | 2026-04-27 | OECD Revenue Stats Asia-Pacific | 2 | _conditional — only if Tier-1 has gaps; requires fresh per-fetch approval_ | _TBD_ | _TBD_ | user (chat) | _conditional_ | _conditional_ |
| P8 | 2026-04-27 | ADB Key Indicators (manual) | 2 | _conditional — Pacific gaps only; manual download_ | _TBD_ | _TBD_ | user (chat) | _conditional_ | _conditional_ |
| P10 | 2026-04-27 | WB Data360 (IMF WoRLD mirror, via R `httr2`) | 1 | Donor pool synthetic control — IMF WoRLD 10 indicators (Total Rev, Tax Total, Income, CIT, PIT, G&S, General G&S, VAT, Excise, Payroll) | Non-EAP LMIC+UMIC, 83 countries (25 SSA, 22 LAC, 18 ECA, 12 MENA, 6 SA) | 2002–2024 | user (chat) | 90/100 | ✅ with caveats |
| P10-WDI-macro | 2026-04-27 | WDI (via R `wbstats`) | 1 | Donor pool macro — GDP, GDP per capita, inflation, tax/GDP | Non-EAP LMIC+UMIC, 83 countries | 2002–2024 | user (chat) | **failed** — WB API timeouts | **deferred** — `tidysynth` matches on lagged outcomes only, macro nice-to-have |
| P13 | 2026-04-28 | WID (via R `wid` package; **Tier-1-direct**) | 1 | Inequality — top 10% / top 1% / bottom 50% pre-tax national income shares + Gini (4 indicators). **Factor shares dropped from scope** (re-probe of WID exhausted; `wpllin`, `wpkkin`, `wccinc`, `wllinc`, `wkkinc`, `wcomes`, `wgsurp`, `wnninc`, `mcomes`, `mgsurp` etc. all returned no data). | 17 EAP (effective 10 country-specific + Pacific 7 dropped) + 83 donor pool (82 effective; Kosovo XKX unmapped) | 1995–2023 | user (chat) | 92/100 (post-dedup) | ✅ with caveats |
| P13-ext | 2026-04-28 | WID (via R `wid` package) | 1 | P13 extension — average pre-tax income (`aptinc`) for Bottom 50% / Top 10% / Top 1% / total + PPP USD exchange rate (`xlcusp`) for income-level evolution chart. | EAP-10 only | 1995–2023 | user (chat, image-based spec) | _within P13_ | ✅ |
| P13-ext-2 | 2026-04-28 | WID (via R `wid` package) | 1 | P13 extension #2 — average disposable income (`adiinc992j`, post-direct-tax + monetary-transfer) for Bottom 50% / P50–P90 / Top 10% / Top 1% / total. WID-native pre-tax vs post-tax cross-check input for Segment 3 Part 2 (SPZ-stylised post-tax simulation). | EAP-10 only | 1995–2023 | user (chat, plan 2026-04-28) | _within P13_ | ✅ |
| P13-ext-3 | 2026-04-28 | WID (via R `wid` package) | 1 | P13 extension #3 — `aptinc992j` and `adiinc992j` at full deciles (D1..D10) plus Top 1% (`p99p100`). Used by 06_post_tax.R for decile-resolution simulation and by 07_figures_part2.R for the per-country percentile-curve figure. | EAP-9 only (KHM dropped) | 1995–2023 | user (chat, 2026-04-28) | _within P13_ | ✅ |
| P14 | 2026-04-28 | WB PIP (Poverty and Inequality Platform, via R `httr2` against `api.worldbank.org/pip/v1/`) | 1 | Consumption (or income) decile shares (decile1..decile10) + Gini + welfare type + survey year. Input for VAT/excise/trade incidence allocation in the SPZ-stylised post-tax simulation (Part 2). Bachas-Gadenne-Jensen 2024 was rejected as primary source after coverage check showed only PNG of 10 EAP countries in their sample. | **EAP-9 (KHM dropped — no PIP coverage)**: IDN, PHL, VNM, THA, MYS, MMR, LAO, MNG, PNG | latest survey per country (2009 PNG → 2025 IDN) | user (chat, plan 2026-04-28) | basic-stats user-signed-off (2026-04-28) | ✅ with caveats (KHM dropped; MYS income welfare; MMR 2017 / PNG 2009 stale) |
| P11 | 2026-04-29 | WB WDI (via R `wbstats`, **Tier-1-direct**) | 1 | Segment 4 structural controls — EAP-17. `NY.GDP.PCAP.PP.KD` (GDP per capita PPP), `SP.URB.TOTL.IN.ZS` (urbanisation), `NE.TRD.GNFS.ZS` (trade openness), `NV.AGR.TOTL.ZS` (agriculture share), `NY.GDP.TOTL.RT.ZS` (resource rents — annotation only). | EAP-17 | 2004–2024 | user (chat, plan 2026-04-29) | basic-stats user-signed-off (2026-04-29) | ✅ |
| P12 | 2026-04-29 | WB WDI (via R `wbstats`, **Tier-1-direct**) | 1 | Same indicators as P11, donor pool. | 76 donor pool (matches Segment 3 cleaned donor list) | 2004–2024 | user (chat, plan 2026-04-29) | basic-stats user-signed-off (2026-04-29) | ✅ |
| P15 | 2026-04-29 | ILOSTAT via R `Rilostat` package (Tier-2 with WB-coverage justification) | 2 | Informal employment as % of total employment (`EMP_NIFL_SEX_RT_A`, sex=SEX_T) for the TSRS in Segment 4 Part 1. WB-coverage justification: WDI does not publish ILO informal-employment series at EAP coverage level needed. | EAP-17 (effective 13 — PHL/MYS/PNG/SLB missing → 5-component TSRS for those 4) + 76 donor pool | 2010–2024 | user (chat, plan 2026-04-29) | basic-stats user-signed-off (2026-04-29) | ✅ |
| P16 | 2026-04-29 | WB WDI (via R `wbstats`, **Tier-1-direct**) | 1 | Financial-inclusion proxy — ATMs per 100,000 adults (`FB.ATM.TOTL.P5`). Replaces Findex (`FX.OWN.TOTL.ZS`) which only covers 9/17 EAP. | EAP-17 (effective 16 — TUV missing) + 76 donor pool | 2004–2024 | user (chat, plan 2026-04-29) | basic-stats user-signed-off (2026-04-29) | ✅ |
| P17 | 2026-04-29 | IMF WEO via DataMapper API (R `httr2` against `www.imf.org/external/datamapper/api/v1/`, Tier-2 with WB-coverage justification) | 2 | 2030 GDP projections — `NGDPD` (nominal GDP USD), `NGDPDPC` (per capita), `NGDP_RPCH` (real growth %), `PPPGDP` (PPP USD), `LP` (population). WB-coverage justification: WDI does not publish forward projections beyond the current year; IMF WEO is the standard cross-country source. | EAP-17 + 76 donor pool | 2024–2030 | user (chat, plan 2026-04-29) | basic-stats user-signed-off (2026-04-29) | ✅ |
| P18 | 2026-04-30 | **User-supplied** — internal WB engagement pipeline note (email from user, 2026-04-30) | 1 | Current 2026 WB tax engagements per EAP country, structured by pillar (Closing Policy Gap / Closing Compliance Gap / Reducing Cost of Compliance), work type (Lending / ASA), activity description, and funding source. 55 distinct activities across 11 EAP countries (LAO, KHM, VNM, IDN, PHL, PNG, MYS, MNG, TON, MHL, VUT) plus one regional PEMNA item. | EAP-11 + Regional | 2026 (snapshot) | user (chat, 2026-04-30) | _user-supplied; no API audit_ | ✅ |
| P19 | 2026-05-01 | **User-supplied** — WB OPCS DPAD database (FY24 vintage), uploaded by user to repo (commit `e4b6e5e`) | 1 | All Development Policy Operation prior actions across the six WB regions, FY2004–FY2024. 11,628 prior actions; 935 tax-tagged after filtering on Theme codes {111 Fiscal Sustainability, 114 Tax Policy, 115 Subnational Fiscal Policies, 412 Domestic Revenue Administration}. Used to construct the cross-region tax DPO comparison in Chapter 1 of the cross-segment note. | All WB regions (AFR, EAP, ECA, LCR, MNA, SAR) — 160 unique countries with at least one DPO prior action in the window | FY2004–FY2024 | user (chat + commit, 2026-05-01) | _user-supplied; no API audit_ | ✅ |

## Caveats noted at sign-off

### Synthetic control analysis — shelved 2026-04-27

**Decision:** the synthetic-control event study (DPO + revenue-raising prior actions on tax/GDP via `tidysynth`, donors = non-EAP LMIC+UMIC) is **shelved as exploratory** pending better data. Only 7 of 15 treated cells fit, post-treatment windows are short (Data360 vintage cliff at 2020), and most cells show effects indistinguishable from placebo permutations. Code preserved in:
- `segment_2_taxation_trends/src/R/04_event_study_diagnostics.R`
- `segment_2_taxation_trends/src/R/04_synthetic_control.R`
- `segment_2_taxation_trends/src/R/07_synthetic_control_figures.R`
- `output/rds/synth_results.rds`
- `output/figures/fig_s2_4_synth_*.png`, `fig_s2_6_synth_*.png`, `fig_s2_7_synth_*.png`

**Canonical Segment 2 analysis** uses the direction-aware alignment heatmaps (Fig S2-4, S2-4a, S2-4b) plus the four descriptive figures (S2-1a/b, S2-2, S2-3, S2-5). May be revisited when fresher IMF data is available.

### P13 — WID inequality, audited 2026-04-28 (signed off with caveats)

**Status note (must be propagated to every downstream agent / figure / table / report):** WID provides only a **Pacific regional aggregate** for 7 of the 17 EAP economies — Fiji, Kiribati, Solomon Islands, Tonga, Tuvalu, Vanuatu, Samoa. The same 1995–2023 inequality time series is repeated under each Pacific country's label. **These 7 countries are dropped from country-level inequality analysis.** Effective country-specific EAP sample for Segment 3 = 10 (Indonesia, Philippines, Vietnam, Thailand, Malaysia, Myanmar, Cambodia, Lao PDR, Mongolia, Papua New Guinea).

**THIS IS A KNOWN GAP REQUIRING FUTURE INVESTIGATION.** Country-level inequality data for the Pacific 7 must be sought from alternative sources (e.g., national statistical offices, ADB, World Bank LSMS surveys, Pacific Data Hub) in a future segment or follow-on engagement. Until then, any inequality finding for the Pacific 7 is regional, not country-specific.

| Caveat | Detail | Mitigation |
|--------|--------|------------|
| **Pacific 7 regional aggregate** | WID publishes one Pacific-7 series, not country-specific | **Drop Pacific 7 from country-level inequality analysis. Flag as a gap to be revisited with non-WID Pacific data.** |
| Donor pool clusters with identical series | LAC residual 9 (BLZ, BOL, GTM, HND, HTI, JAM, NIC, PRY, SUR), Eastern Caribbean 4 (DMA, GRD, LCA, VCT) — same WID regional aggregation | Treat as single regional observations or drop in country-level analyses. |
| Factor shares (labour/capital split) — DROPPED | WID re-probe exhausted (`wpllin`, `wpkkin`, `wccinc`, `wllinc`, `wkkinc`, `wcomes`, `wgsurp`, `wnninc`, `mcomes`, `mgsurp`, etc. all returned no data through the `wid` R package) | **Removed from Segment 3 scope.** Personal income inequality (4 indicators) only. Factor-share angle could be revisited later via Penn World Tables `labsh`. |
| 100% cell coverage in country-specific subset | WID interpolates / extrapolates to fill panel | Document; do not interpret year-on-year wiggles as identification. |
| Donor pool count: 83 fetched (82 effective; Kosovo XKX unmapped) | Earlier doc said 76 — that was the post-TJK/CUB/4-missing count for IMF WoRLD donors | Inventory updated. WID donors = 82 effective. |
| Row-doubling on initial fetch | Some indicator/percentile combinations returned duplicate rows | Fixed in `fetch_wid_inequality.R` with `distinct()`. CSV is now 11,484 rows. |

### P10 — IMF WoRLD donor pool, audited 2026-04-27 (signed off with caveats)

| Caveat | Detail | Mitigation |
|--------|--------|------------|
| Vintage cliff at 2020 | Data360 mirror returns no data after 2020 (1 cell at 2021); data direct from IMF SDMX 3.0 attempted but endpoint structure not reversible without `imfp` package (not available for R 4.5) | **5 treated cells dropped** as unviable (Indonesia × Excise 2020, Indonesia × PIT 2022, Fiji × VAT 2022, Philippines × Excise 2019, PNG × CIT 2019). 10 cells (4 strict + 6 relaxed-match) retained. |
| Tajikistan only 2/19 years coverage | Insufficient as a donor | **Dropped from donor pool** |
| Cuba excluded | Not comparable economy | **Dropped from donor pool** per user direction |
| Algeria, Turkmenistan, Kosovo, Zimbabwe absent from Data360 | Country-specific coverage gaps | Accepted attrition |
| Effective donor pool size: 76 countries | 83 fetched − 1 TJK − 1 CUB − 5 missing | Sufficient for tidysynth synthetic control |
| Hydrocarbon/SACU/CBI economies (LBY, IRQ, LSO, DMA, MNE) show outlier total revenue >50% of GDP | Real economic features (oil, customs, citizenship-by-investment) | Accept; tidysynth weights will pick more representative donors |

### P5 — WDI inflation and exchange rate, audited 2026-04-27 (signed off with caveats)

| Caveat | Detail | Mitigation |
|--------|--------|------------|
| Lao PDR 2022–2024 inflation 23–31% | Real macro episode (post-COVID kip depreciation pass-through) | Accept as-is; annotate |
| Myanmar 2011→2012 FX +11,668% (5.44 → 640.65 MMK/USD) | Official currency reform — peg removed | **Add structural-break dummy at MMR-2012** in econometric specifications |
| Tuvalu: 1/21 inflation obs, FX = AUD | Tuvalu uses Australian dollar; WDI does not track separately | **Exclude TUV** from any specification using inflation or FX |
| Myanmar truncated 2019 (CPI) / 2020 (FX) | Same publication-lag ceiling as other indicators | Same handling as elsewhere |

### P4 — IMF WoRLD via Data360, audited 2026-04-27 (signed off with caveats)

| Caveat | Detail | Mitigation |
|--------|--------|------------|
| Kiribati / Tuvalu Total Revenue >50% of GDP | Same Pacific pattern as P3 (fishing licences, trust funds) | Accept as substantive feature |
| WDI vs WoRLD |Δ| > 2 pp on 31 country-year cells | Methodology + source mismatch | **Default to WDI** (P1) — WB priority preserved. Use WoRLD only where WDI has gaps. |
| Composition arithmetic: RTGS ≠ RTGSG + RTGSE; RT ≠ Σ components | "Other taxes" residual not split out in WoRLD | Document and proceed (`02_clean.R` will compute residual = RT − Σ named components) |
| MMR, TON, TUV, VUT have only headline RT (no CIT/PIT/VAT/Excise breakdown) in WoRLD | Source uses IMF WEO methodology for these countries; WEO only publishes headline | **Escalate to P6 (IMF GFS) as a fresh Tier-2 request** for these four countries' composition. P6 is approved at intent level; will need fresh per-source confirmation. |
| Vietnam 2012 −5.7 pp YoY on RT | Real CIT/VAT reform | Annotate rather than smooth |
| WoRLD coverage stops 2019–2021 (no 2022–2024 despite vintage notes) | Confirmed empirical limit of current WoRLD release | Do not pursue 2022–2024 from WoRLD; either accept gap or seek from P6/P7 |
| Coverage gap fill: IDN 3→18 yrs, VNM 0→17 yrs, TUV 0→15 yrs, MMR/PNG extended | Substantial gain over WDI alone — primary justification for P4 | Headline RT now available 2004–2021 for ~all 17 countries |

### P3 — WDI Revenue ex-grants, audited 2026-04-27 (signed off with caveats)

| Caveat | Detail | Mitigation |
|--------|--------|------------|
| Kiribati > 100% of GDP (max 107.6 in 2015) | Substantive — fishing licence fees + RERF interest, well-documented Pacific pattern | Keep raw; cross-check against IMF WoRLD (P4) before final use |
| Lao PDR 2006/2007 ≈ 2.5e-11 (placeholder) | WDI placeholder for missing data | Coerce to NA in `02_clean.R`; cross-check with IMF WoRLD (P4) |
| Myanmar 2018 drop to 7.01% | Classification break at fiscal-year shift | Accept and note in figures |
| Vietnam, Tuvalu missing | Same gap as P1 | Confirmed Tier-1 fallback to P4 (IMF WoRLD) |
| Indonesia stops 2009 (14% coverage) | Same WDI gap as P1 | Tier-1 fallback to P4 |
| Mongolia non-tax gap ~12 pp | Mining royalties + SOE dividends — user-confirmed plausible | No mitigation needed |

### P2 — WDI GDP and GDP per capita, audited 2026-04-27 (signed off with caveats)

| Caveat | Detail | Mitigation |
|--------|--------|------------|
| Tonga and Tuvalu latest year = 2023 | Both indicators have publication lag for these countries | Use 2023 as latest where applicable; flag in figures |
| Fiji 2020 nominal GDP −20.9% | Real macro shock (COVID + tourism collapse) | Annotate; consistent with P1 finding |
| Myanmar 2008 +57.9% | FX/peg revaluation | Use cautiously; constant US$ alternative may be needed for cross-time comparison |
| Mongolia 2010 +56.8% | Commodity recovery from GFC | Real macro event |
| PNG 2006 +71.7% | LNG investment cycle / kina FX | Real macro event |

### P1 — WDI tax revenue (% of GDP), audited 2026-04-27 (signed off with caveats)

| Caveat | Detail | Mitigation |
|--------|--------|------------|
| Indonesia data stops 2009 | WDI series for `GC.TAX.TOTL.GD.ZS` ends in 2009 for IDN | Pick up post-2009 from P4 (IMF WoRLD via Data360) |
| Myanmar data stops 2019 | Coup-related publication gap | P4 fallback; document in final report |
| Vietnam, Tuvalu absent | Country has no observations in this WDI indicator | P4 fallback |
| PNG coverage 47.6% | Below protocol's 50% threshold | P4 fallback; flag in final report |
| Fiji 2020 −5.23 pp | Real macro shock (COVID + tourism collapse) | Annotate in figures |
| Mongolia 2007 −5.23 pp; 2009 −5.11 pp | Commodity cycle / GFC | Annotate in figures |

## Rejected proposals

Items the user reviewed in `data_inventory_proposal.md` but chose not to use are listed here with the rejection reason.

| Date | Source | Reason |
|------|--------|--------|
| 2026-04-27 | P9 — FRED | User judgment: not relevant for an EAP-focused study (FRED has limited EAP coverage) |
