|
|
||
|---|---|---|
| deploy | ||
| docs | ||
| scripts | ||
| sejm2git | ||
| tests | ||
| .gitignore | ||
| corpus.laws | ||
| pyproject.toml | ||
| pyrightconfig.json | ||
| README.md | ||
sejm2git
Turn Polish legislation into a living git repository of consolidated law.
The Sejm publishes every act through the official API at api.sejm.gov.pl (the ELI
service for published acts, and the Sejm service for the legislative process). sejm2git
downloads acts, parses them into a structured document model, applies amendments
("nowelizacje") to reconstruct the currently-in-force text, and materializes the
result into git — where each law is one Markdown file whose git log -p is its
amendment history, and every amending act is a branch merged on enactment.
Status
Working end-to-end prototype. Proven on the ustawa o ograniczeniu handlu w niedziele i
święta (Dz.U. 2018 poz. 305): its full chain of 5 amendments (2018–2024) consolidates
at 100% — 20/20 change instructions applied — and replays into a branch-per-amendment,
merge-on-enactment git history. Validated against the official tekst jednolity (the
authoritative PDF, DU/2025/301): the reconstructed substantive text is 99.9% verbatim-contained
in the official consolidation, the remainder fully explained (the official omits spent
consequential-amendment provisions and refreshes Dz.U. citations) — no substantive discrepancy.
Across the complete 2016 cohort — all 35 statutes of the year with a published
consolidated text, including chains of up to 90 amendments — the mean substantive
recall is 99.0% (32 MATCH / 2 CLOSE / 1 DIVERGENT), with zero unrecognized
instruction forms left in the corpus; see
docs/fidelity-dashboard.md. The dashboard doubles as a
regression gate: dashboard --fail-under 97 exits non-zero when mean recall drops.
* Wejście w życie: Dz.U. 2024 poz. 1965 (2025-02-01)
|\
| * Dz.U. 2024 poz. 1965: Ustawa ... o zmianie ustawy o dniach wolnych od pracy ...
|/
* Wejście w życie: Dz.U. 2023 poz. 2626 (2023-12-05)
|\ ...
* Dz.U. 2018 poz. 305: Ustawa ... o ograniczeniu handlu w niedziele i święta ...
* Initialize legal corpus
git show of an amendment commit is the legal change as a clean diff:
**Art. 7.**
1. Zakaz, o którym mowa w art. 5, nie obowiązuje w:
- 1) kolejne dwie niedziele poprzedzające pierwszy dzień Bożego Narodzenia;
+ 1) kolejne trzy niedziele poprzedzające Wigilię Bożego Narodzenia;
...
-1a. Jeżeli niedziela poprzedzająca pierwszy dzień Bożego Narodzenia przypada ...
+1a. (uchylony)
+1b. W przypadku, o którym mowa w ust. 1 pkt 1, pracownik lub zatrudniony może ...
Install & run
pip install -e . # core; or: pip install requests lxml
pip install -e '.[validation]' # + pdfplumber, for the `validate` command
# 1. parse a single act as enacted -> Markdown
python -m sejm2git.cli parse DU/2018/305
# 2. reconstruct the in-force text via the whole amendment chain (+ a coverage report)
python -m sejm2git.cli consolidate DU/2018/305 -o handel.md
# 2b. check the reconstruction against the official consolidated PDF (tekst jednolity)
python -m sejm2git.cli validate DU/2018/305
# -> [MATCH] substantive recall (reconstruction ⊆ official): 99.0%
# 2c. corpus-wide fidelity dashboard (discovers consolidated statutes, flags coverage gaps);
# --fail-under makes it a CI/regression gate for the scheduled sync
python -m sejm2git.cli dashboard --years 2016 --limit 35 --fail-under 97 -o docs/fidelity-dashboard.md
# 3. materialize the law's enactment + amendment history into a git repo
python -m sejm2git.cli build DU/2018/305 --repo corpus
git -C corpus log --graph --oneline
# ...add --dead-branches to also branch every failed bill (rejected/withdrawn/lapsed)
# and every live bill currently in the Sejm that targets this law
python -m sejm2git.cli build DU/2018/305 --repo corpus --dead-branches
git -C corpus log --graph --oneline --all
Fetched acts are cached under data/cache/; add --offline to run from the cache only.
Further reading
docs/DECISIONS.md— every design choice, its rationale, and the alternatives, including the ones the data forced.docs/lawmaking-vs-git.md— an essay on the structural points where human lawmaking refuses to fit git's model (effective dates vs causal order, one act with several merge dates, retroactivity vs the immutable past, amendments as programs — and amendments of amendments: patches that patch pending patches, grammar as load-bearing syntax, a patch language with no checksum).docs/voting-in-git.md— representing votes in git: MPs as stable git identities, the queryable "who voted for what" index, and the "democratic merge" (and whyCo-Authored-Bycan credit support but not dissent).docs/validation-report.md— the cross-check against the official tekst jednolity: method, the 97.2% result, the full discrepancy taxonomy, the bugs it surfaced, and what it does and does not establish.
How it works
| module | role |
|---|---|
client.py |
ELI + Sejm API client (act metadata, HTML text, references, legislative process) with an on-disk cache |
model.py |
addressable document tree (Unit): legal lookup/insert/remove by (kind, num) — articles are global but nest under chapters, so structural units are transparent during lookup |
parser.py |
ELI act HTML → Act tree; handles chapter headings, enumeration common-parts (część wspólna), tiret lists, annex parts (załączniki — text, flattened tables, or PDF-stub with provenance), keeps quoted amendment payloads on the tree (Unit.cites), and strips the footnote apparatus while preserving real superscripts |
markdown.py |
Act tree → diff-friendly Markdown (one editorial unit per line) |
amend.py |
extracts a small instruction language (otrzymuje brzmienie, dodaje się, uchyla się, skreśla się wyrazy, zastępuje się wyrazami, nested w art. 7: contexts, tiret edits, range/list targets, annex replacement/edits) from a parsed act tree and applies it to a base tree; each instruction is tagged with its source path within the amending act |
consolidate.py |
replays a law's whole amendment chain in chronological order to build in-force text — first applying second-order amendments (acts that rewrite an amending act's instructions before they enter into force) to the amending act itself |
entryforce.py |
parses the final wchodzi w życie article into concrete dates — offsets, fixed dates, pierwszego dnia miesiąca następującego…, and staggered z wyjątkiem <provisions> tranches |
repo.py |
git materialization: branch per amendment, --no-ff merge on enactment, split into one merge per effective-date tranche |
cli.py |
parse / consolidate / build |
Why consolidation is reconstructed, not downloaded
ELI serves original enacted acts and amending acts as clean, structured HTML — but official consolidated texts (tekst jednolity) only as PDF. So we rebuild the consolidated text ourselves by replaying amendments. Each amending act names its target by an exact ELI; consolidations of one law share an ELI lineage (the act plus all its tekst jednolity positions), which is how we know which instructions belong to a law.
Data model insight (the bridge to "under voting → branch")
An act's metadata links it to its bill: prints[].linkProcessAPI → the Sejm
/processes/{nr} endpoint, whose stages[] track the bill from arrival through readings,
committees and votings to passed: true and the assigned ELI. That timeline is the spine
of the branch/merge model:
- a bill in progress → an open branch (commits per stage);
passed+ signed + published → merge tomain.
The current build command demonstrates this on historical (already-enacted) acts by
replaying their chain as branch→merge cycles. Driving it fully live off the process API for
in-flight bills is the remaining automation step (see Roadmap).
Dead branches (and live ones) — --dead-branches
Bills that tried to amend a law but never became law are found in the process API by title
match (they carry no ELI), classified by their stages, and turned into unmerged branches
that fork from main at the law's state on the bill's introduction date:
odrzucone/…— rejected in a vote (decision: "odrzucony");wycofane/…— withdrawn by the sponsor;niezakonczone/…— lapsed when the term ended (zasada dyskontynuacji) — but only if the term has actually ended;w-toku/…— a bill still moving in the current term: an open branch (a live PR), not a dead one.passed:falsealone never means "failed" in the sitting term.
The would-be text diff of a failed bill is not reconstructed: bill texts (druki) are PDF attachments, the same PDF wall as the consolidated texts.
The legislative journey: one commit per stage
Every branch — enacted, dead, or live — replays the bill's stages[] as one commit per
stage, dated at the stage date, with voting tallies in the commit body:
III czytanie na posiedzeniu Sejmu — uchwalono
Głosowanie #167: za 403, przeciw 10, wstrzym. 12 (głosowało 425)
So git log nowelizacja/DU-2024-1965 reads as a real PR timeline: wpłynął → I/II/III czytanie
→ Senat → Prezydent podpisał → Uchwalono → (the enacted text change) → merged to main. A
live bill's branch shows the same arc up to where it currently sits, unmerged.
With --rollcall, each vote commit also gets a per-club summary and the full named list, and
the enactment merge carries the passage roll-call as machine-readable voter trailers —
the "democratic merge". Each MP is a stable git identity, so the corpus becomes a two-way
voting index:
# who voted against a given enactment:
git show -s --format='%(trailers:key=Vote-against,valueonly)' <merge>
# every law a given MP voted FOR, across all terms:
git log --all --grep 'Vote-for: .*<tusk.donald@poslowie.sejm.gov.pl>'
See docs/voting-in-git.md for the model and its limits.
Scheduled sync (the automation)
build is deterministic: same API state + same dates ⇒ identical commit hashes. So a
re-run is a clean sync, not churn — the repo changes only when reality does. The packaged
service runs the whole loop:
# one sync = fidelity gate -> rebuild every tracked law -> push on change
scripts/sejm2git-sync
# as a daily systemd user timer:
cp deploy/sejm2git-sync.{service,timer} ~/.config/systemd/user/
systemctl --user daemon-reload && systemctl --user enable --now sejm2git-sync.timer
Tracked laws live in corpus.laws (one ELI per line). The sync first runs
dashboard --fail-under as a regression gate — if mean substantive recall against the
official consolidated texts drops below the threshold, nothing is rebuilt or published.
Then each law is rebuilt (--clean --refresh: wipe the fully-derived repo, re-fetch the
mutable Sejm process data while immutable ELI acts stay cached) and the corpus is pushed
with --force — honest semantics for a repo that is a pure function of the API: when
reality is unchanged the hashes are identical and the push is a no-op; when upstream data
was corrected retroactively, the derived history is allowed to be rewritten. Configuration
is environment variables (SEJM2GIT_PUSH_URL, SEJM2GIT_FAIL_UNDER=0 to skip the gate, …
— see the script header); it must live outside the corpus repo, because --clean wipes
it each run.
Add --rollcall (via SEJM2GIT_BUILD_FLAGS="--dead-branches --rollcall") to also embed
per-MP voting records (per-club summary + named lists on vote commits, voter trailers on
the merge). It is opt-in because it adds ~460 names per vote and one cached API call per
vote.
Across runs a bill moves on its own: w-toku/… → a nowelizacja/… merge once enacted, or
→ odrzucone/… once voted down — because the underlying passed/stages/ELI changed,
not because of any local bookkeeping.
Entry into force (wchodzi w życie)
Merges are dated at the entry-into-force date, not publication, so git blame reflects
the law as actually in force. The authoritative date comes from the API; the parsed
schedule (entryforce.py) only overrides it for provisions named in an explicit
exception. Staggered acts ("…z wyjątkiem art. 2, który wchodzi w życie z dniem ogłoszenia")
are split into one dated merge per tranche. Vacatio legis (the publication→effect gap) is
naturally an approved-but-not-yet-merged PR; retroactive laws (z mocą od dnia…) get a
backdated commit and should be flagged.
Validation against the official text
sejm2git validate <eli> (in validate.py) cross-checks the reconstruction against the
authoritative consolidated PDF. Because consolidations are PDF-only, it extracts the body text
(font-size filtered to drop footnotes and superscript markers), normalizes editorial-only
differences (refreshed Dz.U. citations, line-break hyphenation, whitespace), and reports the
character-level recall of the reconstruction within the official text. For the shop-trading
act this is 97.2%, with the gap fully attributed: the official omits spent
consequential-amendment provisions (which the reconstruction keeps) and refreshes
cross-reference citations — neither is a normative discrepancy. (Validation also surfaced and
fixed a real bug: the ELI HTML ships a malformed charset=UTF-8; charset=UTF-8 meta that some
lxml versions mis-sniff, so the parser now decodes UTF-8 explicitly.)
Known limitations / honest gaps
- Source quirk: ELI's HTML occasionally glues a single-letter preposition to the next word (e.g. "Wzakresie" for "W zakresie"). Faithfully preserved for now, not patched.
- Instruction coverage handles the common amendment forms; any unrecognized form is recorded (never silently dropped) and surfaced in the consolidation report and commit message, so laws needing attention are flagged rather than silently mis-consolidated.
- Scope per the design: statutes (ustawy) + their executive regulations. Regulations
aren't voted, so they commit straight to
mainrather than via the branch model.
Roadmap
- Extend the validated corpus to the 2017–2019 cohorts (~98 more statutes — pure compute, the engine and caches are ready).
- Reconstruct failed/in-flight bills' would-be diffs (needs a PDF parser for druki).
- Apply-time long tail (~220 unapplied of 3096, zero unrecognized forms): cascading misses on ~100-instruction omnibus chains, no-HTML amending acts (2025–2026 — a source gap that backfills over time), deep in-house-provision nesting.
Done along the way: annex amendments (D27); second-order amendments — acts that rewrite another amending act's instructions before they enter into force — plus the extraction-completeness overhaul they required (payload-bearing trees, tiret lists, range/list targets; D28); full-cohort scale with sentence-level ops, content designation, lead-in replaces, and the performance work that makes a 90-amendment chain validate in minutes (D29); the scheduled sync service — fidelity gate, rebuild, push-on-change as a daily systemd timer (D30).