Getting data of an election
Getting data of an election
On the 26th of May, we had the european parliamentary elections. In Romania, the results and progress of the vote were published online in real time on the official electoral site.
As far as I know it’s the first time we had such data exposed to the public, and with such granularity.
Since my daily work involves working closely with data, I couldn’t miss the oportunity to get my hands on that dataset. Of course, the site is not documented and there aren’t any publicly available API’s to begin with. So I spent some hours debugging the underlying stack to see how I can query it and compile it in a usable format.
As far as I see it’s built with React, using some NoSQL as the backend. I’m betting on NoSQL because, while doing the ETL I’ve found some schema inconsistencies that shouldn’t normally happen if the data sat on top of a SQL DB.
Understanding the API
Maybe there’s better way to do this, but what I did was start the developer console of the browser, refresh the election page and look for a request that seemed to contain the data that I was looking for. Using this approach I’ve found the following endopoints that I could query.
An example of how you interogate the BEC site, for the presence
data. You need to know the county
(in this case AR, code for ARAD).
Code
!curl 'https://prezenta.bec.ro/europarlamentare26052019/data/pv/json//pv_AR.json' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-pv-part' -H 'authority: prezenta.bec.ro' --compressed -o "_data/AR.json"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 73871 0 73871 0 0 319k 0 --:--:-- --:--:-- --:--:-- 319k
There is also an endpoint publishing the presence
count (the number of people that voted so far ). Again, we also need to query this for each county
.
Code
!curl 'https://prezenta.bec.ro/europarlamentare26052019/data/presence/json/presence_AR_now.json' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-precincts' -H 'authority: prezenta.bec.ro' --compressed -o "_data/AR-presence.json"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 60786 0 60786 0 0 690k 0 --:--:-- --:--:-- --:--:-- 690k
There is also another csv that we can use, which contains the presence
data in a single big file.
Code
!curl 'https://prezenta.bec.ro/europarlamentare26052019/data/presence/csv/presence_now.csv' -H 'Referer: https://prezenta.bec.ro/europarlamentare26052019/abroad-pv-part' --compressed -o "_data/all_presence.csv"
Fetching the data
Getting all the data ouf of the site. Each county has a dedicated page which contains information about its stats. By looking over the source of the site we can compile a list of all counties that we need to inspect. It’s instersting that the S1..S6 (Bucharest’s sectors) were modeled as counties.
Code
counties = ["AR", "AB", "AR", "AG", "BC", "BH", "BN", "BT", "BV", "BR", "BZ", "CS", "CL", "CJ", "CT", "CV", "DB", "DJ", "GL", "GR", "GJ", "HR", "HD", "IL", "IS", "IF", "MM", "MH", "MS", "NT", "OT", "PH", "SM", "SJ", "SB", "SV", "TR", "TM", "TL", "VS", "VL", "VN", "B", "SR", "S1", "S2", "S3", "S4", "S5", "S6"]
len(counties)
The vote
information is stored on the data/pv/json/
route, specific for each county. In order not to make multiple queries while testing, we first cache all the results localy and we can refer to them later on.
Above we’ve deduced the counties we can have, but I’ve found that there are slight asymetries for certain cases (mostly regarding the expat data and the way Bucharest is represented).
It’s because of this that we need to handle the counties list in a case-by-case fashion.
Code
import json
from tqdm import tqdm_notebook as tqdm
counties = ["AR", "AB", "AR", "AG", "BC", "BH", "BN", "BT", "BV", "BR", "BZ", "CS", "CL", "CJ", "CT", "CV", "DB", "DJ", "GL", "GR", "GJ", "HR", "HD", "IL", "IS", "IF", "MM", "MH", "MS", "NT", "OT", "PH", "SM", "SJ", "SB", "SV", "TR", "TM", "TL", "VS", "VL", "VN", "S1", "S2", "S3", "S4", "S5", "S6"]
for county in tqdm(counties):
!curl 'https://prezenta.bec.ro/europarlamentare26052019/data/pv/json//pv_{county}.json' -H 'accept-encoding: gzip, deflate, br' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-pv-part' -H 'authority: prezenta.bec.ro' --compressed -o "_data/{county}.json"
The presence
data is (as above) stored in a different (data/presence/json/
) route specific to each county. Again, we fetch everything an cache localy.
Code
counties = ["AR", "AB", "AR", "AG", "BC", "BH", "BN", "BT", "BV", "BR", "BZ", "CS", "CL", "CJ", "CT", "CV", "DB", "DJ", "GL", "GR", "GJ", "HR", "HD", "IL", "IS", "IF", "MM", "MH", "MS", "NT", "OT", "PH", "SM", "SJ", "SB", "SV", "TR", "TM", "TL", "VS", "VL", "VN", "B", "SR"]
for county in tqdm(counties[-8:-6]):
!curl 'https://prezenta.bec.ro/europarlamentare26052019/data/presence/json/presence_{county}_now.json' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-precincts' -H 'authority: prezenta.bec.ro' --compressed -o "_data/{county}-presence.json"
Let’s also get the all in one
data about the presence
.
Code
!curl 'https://prezenta.bec.ro/europarlamentare26052019/data/presence/csv/presence_now.csv' -H 'Referer: https://prezenta.bec.ro/europarlamentare26052019/abroad-pv-part' --compressed -o "_data/all_presence.csv"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2618k 0 2618k 0 0 5729k 0 --:--:-- --:--:-- --:--:-- 5729k
Compiling the data
Loading a presence
file
When reading the presence
file, there’s some manipulation that we need to do because the original returned json contains lots of information that seemed either useless or redundant (info we already had in other places), or information that I didn’t know how to interpret.
There was also the age_ranges
field which was contained actually a list of values, that I needed to exapend into individual columns, by using a transform function.
Code
def process_row(row):
return tuple(row.age_ranges.values())
def load_presence(presence_file_name):
_json = read_json_file(presence_file_name)
_df = pd.DataFrame.from_records(_json['precinct'])
_df[["men_18_24", "men_25_34", "men_35_44", "men_45_64", "men_65+", "women_18_24", "women_25_34", "women_35_44", "women_45_64", "women_65+"]] = _df.apply(process_row, axis=1, result_type='expand')
_df.drop(columns=['age_ranges'], inplace=True)
_df.columns = [
'liste_permanente',
'lista_suplimentare',
'total',
'urna_mobila',
'county_code',
'county_name',
'id_county',
'id_locality',
'id_precinct',
'id_uat',
'initial_count',
'latitude',
'locality_name',
'longitude',
'medium',
'precinct_name',
'precinct_nr',
'presence',
'siruta',
'uat_code',
'uat_name',
'men_18_24',
'men_25_34',
'men_35_44',
'men_45_64',
'men_65+',
'women_18_24',
'women_25_34',
'women_35_44',
'women_45_64',
'women_65+',
]
return _df
tulcea = load_presence("_data/TL-presence.json")
tulcea.head()
liste_permanente | lista_suplimentare | total | urna_mobila | county_code | county_name | id_county | id_locality | id_precinct | id_uat | ... | men_18_24 | men_25_34 | men_35_44 | men_45_64 | men_65+ | women_18_24 | women_25_34 | women_35_44 | women_45_64 | women_65+ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 409 | 31 | 440 | 0 | TL | TULCEA | 38 | 8884 | 15848 | 2882 | ... | 13 | 23 | 36 | 84 | 29 | 17 | 22 | 55 | 110 | 51 |
1 | 471 | 73 | 544 | 0 | TL | TULCEA | 38 | 8884 | 15849 | 2882 | ... | 11 | 28 | 55 | 90 | 71 | 10 | 40 | 62 | 92 | 85 |
2 | 14 | 20 | 34 | 0 | TL | TULCEA | 38 | 8909 | 15943 | 2892 | ... | 1 | 1 | 5 | 10 | 4 | 0 | 2 | 4 | 5 | 2 |
3 | 134 | 35 | 169 | 0 | TL | TULCEA | 38 | 8914 | 15948 | 2894 | ... | 5 | 6 | 17 | 37 | 21 | 4 | 9 | 7 | 37 | 26 |
4 | 638 | 70 | 708 | 0 | TL | TULCEA | 38 | 8912 | 15946 | 2894 | ... | 19 | 42 | 46 | 185 | 78 | 17 | 42 | 53 | 142 | 84 |
5 rows × 31 columns
Getting all the individual result
files for all precint_nr
The voting results (what voted for what, in which place) are stored in the data/pv
route. The problem is that if we only use the county
file, downloaded in step 3, there isn’t any keys that will link the results to the presence rows. To be more exact, the results are listed per voting facilty, whereas the presence is aggregated at a locality level. This means that we can’t really jon the two.
Fortuantely, I’ve found that if you know the id of a specific voting facilty, you could ask for the resuls of that specific facility through a version of the original results
API.
So the final strategy that worked was something along the following lines:
- For all
countyes
, we will load thepresence
file associated to it - inspect all the
precint_nr
that is contains - individually query the api for the results of that
precint_nr
.
Code
counties = ["AR", "AB", "AR", "AG", "BC", "BH", "BN", "BT", "BV", "BR", "BZ", "CS", "CL", "CJ", "CT", "CV", "DB", "DJ", "GL", "GR", "GJ", "HR", "HD", "IL", "IS", "IF", "MM", "MH", "MS", "NT", "OT", "PH", "SM", "SJ", "SB", "SV", "TR", "TM", "TL", "VS", "VL", "VN", "SR"]
for county in tqdm(counties):
df_county = load_presence(f"_data/{county}-presence.json")
for precinct_nr in tqdm(df_county['precinct_nr'].values, leave=False):
file_name = f"_data/{county}_results_{precinct_nr}.json"
if not os.path.exists(file_name):
!curl 'https://prezenta.bec.ro/europarlamentare26052019/data/pv/csv/pv_{county}_{precinct_nr}_EUP_PART.csv' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-pv-part' -H 'authority: prezenta.bec.ro' -H 'cookie: _ga=GA1.2.772980748.1558943895; _gid=GA1.2.1466959792.1561374632' --compressed --silent -o "_data/{county}_results_{precinct_nr}.json"
else:
with open(file_name) as f:
file_contents = f.read()
if "Cod birou electoral" not in file_contents:
print(f"File: {file_name} has bad content {file_contents[:50]}. Will retry")
os.remove(file_name)
!curl 'https://prezenta.bec.ro/europarlamentare26052019/data/pv/csv/pv_{county}_{precinct_nr}_EUP_PART.csv' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-pv-part' -H 'authority: prezenta.bec.ro' -H 'cookie: _ga=GA1.2.772980748.1558943895; _gid=GA1.2.1466959792.1561374632' --compressed --silent -o "_data/{county}_results_{precinct_nr}.json"
Bucharest is a special case. It’s treated as a county but the results are stored by sectors so we need to do things a bit different.
Code
county = "B"
df_county = load_presence(f"_data/{county}-presence.json")
for id_sector in tqdm(df_county.id_locality.unique()):
sector = f"S{int(id_sector) - 9910 + 1}"
print(f"Processing: {sector}")
county = sector
for precinct_nr in tqdm(df_county[df_county.id_locality == id_sector]['precinct_nr'].values, leave=False):
file_name = f"_data/{county}_results_{precinct_nr}.json"
if not os.path.exists(file_name):
!curl 'https://prezenta.bec.ro/europarlamentare26052019/data/pv/csv/pv_{county}_{precinct_nr}_EUP_PART.csv' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-pv-part' -H 'authority: prezenta.bec.ro' -H 'cookie: _ga=GA1.2.772980748.1558943895; _gid=GA1.2.1466959792.1561374632' --compressed --silent -o "_data/{county}_results_{precinct_nr}.json"
else:
with open(file_name) as f:
file_contents = f.read()
if "Cod birou electoral" not in file_contents:
print(f"File: {file_name} has bad content {file_contents[:50]}. Will retry")
os.remove(file_name)
!curl 'https://prezenta.bec.ro/europarlamentare26052019/data/pv/csv/pv_{county}_{precinct_nr}_EUP_PART.csv' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-pv-part' -H 'authority: prezenta.bec.ro' -H 'cookie: _ga=GA1.2.772980748.1558943895; _gid=GA1.2.1466959792.1561374632' --compressed --silent -o "_data/{county}_results_{precinct_nr}.json"
In SR
we have data about the foreign offices.
Code
load_presence("_data/SR-presence.json").head().T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
liste_permanente | 0 | 0 | 0 | 0 | 0 |
lista_suplimentare | 18 | 57 | 865 | 79 | 1330 |
total | 18 | 57 | 865 | 79 | 1330 |
urna_mobila | 0 | 0 | 0 | 0 | 0 |
county_code | SR | SR | SR | SR | SR |
county_name | STRAINATATE | STRAINATATE | STRAINATATE | STRAINATATE | STRAINATATE |
id_county | 43 | 43 | 43 | 43 | 43 |
id_locality | 10244 | 10178 | 10334 | 10206 | 9996 |
id_precinct | 18619 | 18627 | 19096 | 18716 | 18723 |
id_uat | 3230 | 3218 | 3245 | 3219 | 3186 |
initial_count | 0 | 0 | 0 | 0 | 0 |
latitude | None | None | None | None | None |
locality_name | KABUL | BUENOS AIRES | RENNES | TBILISI | STUTTGART |
longitude | None | None | None | None | None |
medium | U | U | U | U | U |
precinct_name | Eticheta Credentiale/Tableta - 1 | Eticheta Credentiale/Tableta - 10 | Eticheta Credentiale/Tableta - 100 | Eticheta Credentiale/Tableta - 101 | Eticheta Credentiale/Tableta - 102 |
precinct_nr | 1 | 10 | 100 | 101 | 102 |
presence | 0 | 0 | 0 | 0 | 0 |
siruta | None | None | None | None | None |
uat_code | AF | AR | FR | GE | DE |
uat_name | AFGANISTAN | ARGENTINA | FRANȚA | GEORGIA | GERMANIA |
men_18_24 | 0 | 1 | 50 | 1 | 71 |
men_25_34 | 4 | 3 | 155 | 9 | 258 |
men_35_44 | 11 | 4 | 172 | 18 | 270 |
men_45_64 | 3 | 7 | 141 | 14 | 249 |
men_65+ | 0 | 1 | 8 | 1 | 4 |
women_18_24 | 0 | 0 | 37 | 1 | 50 |
women_25_34 | 0 | 10 | 119 | 13 | 182 |
women_35_44 | 0 | 13 | 93 | 11 | 128 |
women_45_64 | 0 | 11 | 76 | 9 | 112 |
women_65+ | 0 | 7 | 14 | 2 | 6 |
Reading all the presence data into a single DataFrame
We now have all the presence
data cached, and we’ll read it into a single dataframe.
Code
counties = ["AR", "AB", "AR", "AG", "BC", "BH", "BN", "BT", "BV", "BR", "BZ", "CS", "CL", "CJ", "CT", "CV", "DB", "DJ", "GL", "GR", "GJ", "HR", "HD", "IL", "IS", "IF", "MM", "MH", "MS", "NT", "OT", "PH", "SM", "SJ", "SB", "SV", "TR", "TM", "TL", "VS", "VL", "VN", "SR", "B"]
df_precints = pd.concat((load_presence(f) for f in tqdm(glob("_data/*-presence.json"))), ignore_index=True)
df_precints.shape
(19171, 31)
Code
df_precints.head().T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
liste_permanente | 696 | 140 | 501 | 571 | 680 |
lista_suplimentare | 63 | 10 | 25 | 41 | 55 |
total | 759 | 150 | 526 | 612 | 736 |
urna_mobila | 0 | 0 | 0 | 0 | 1 |
county_code | VS | VS | VS | VS | VS |
county_name | VASLUI | VASLUI | VASLUI | VASLUI | VASLUI |
id_county | 39 | 39 | 39 | 39 | 39 |
id_locality | 9015 | 9015 | 9006 | 9006 | 9006 |
id_precinct | 16128 | 16187 | 16086 | 16087 | 16088 |
id_uat | 2936 | 2936 | 2933 | 2933 | 2933 |
initial_count | 1470 | 1840 | 1354 | 1375 | 1570 |
latitude | 46.6389853639550 | 46.6421141774663 | 46.2240238056566 | 46.2278431009305 | 46.2278431009305 |
locality_name | VASLUI | VASLUI | BÂRLAD | BÂRLAD | BÂRLAD |
longitude | 27.7326775437114 | 27.7289502189002 | 27.6775710052581 | 27.6686353095150 | 27.6686353095150 |
medium | U | U | U | U | U |
precinct_name | CASA DE CULTURĂ A SINDICATELOR ”CONSTANTIN TĂN... | ȘCOALA GIMNAZIALĂ ”CONSTANTIN PARFENE” | ŞCOALA GIMNAZIALĂ ”VICTOR IOAN POPA” | CASA DE CULTURĂ A SINDICATELOR ”GEORGE TUTOVEANU” | CASA DE CULTURĂ A SINDICATELOR ”GEORGE TUTOVEANU” |
precinct_nr | 1 | 10 | 100 | 101 | 102 |
presence | 51.6327 | 8.1522 | 38.8479 | 44.5091 | 46.879 |
siruta | 161954 | 161954 | 161801 | 161801 | 161801 |
uat_code | None | None | None | None | None |
uat_name | MUNICIPIUL VASLUI | MUNICIPIUL VASLUI | MUNICIPIUL BÂRLAD | MUNICIPIUL BÂRLAD | MUNICIPIUL BÂRLAD |
men_18_24 | 13 | 4 | 18 | 15 | 11 |
men_25_34 | 31 | 5 | 25 | 29 | 40 |
men_35_44 | 66 | 17 | 60 | 52 | 60 |
men_45_64 | 110 | 28 | 104 | 113 | 158 |
men_65+ | 115 | 20 | 43 | 72 | 81 |
women_18_24 | 22 | 6 | 17 | 22 | 14 |
women_25_34 | 34 | 8 | 33 | 32 | 46 |
women_35_44 | 78 | 15 | 64 | 55 | 52 |
women_45_64 | 171 | 28 | 117 | 127 | 178 |
women_65+ | 119 | 19 | 45 | 95 | 96 |
The all_presence.csv
file contains information about age groups, more granular than the bucketed info found in the county files. We will merge it with the current dataframe.
Code
_all_df = pd.read_csv("_data/all_presence.csv")
_all_df.head().T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
Judet | AB | AB | AB | AB | AB |
UAT | MUNICIPIUL ALBA IULIA | MUNICIPIUL ALBA IULIA | MUNICIPIUL SEBEŞ | MUNICIPIUL SEBEŞ | MUNICIPIUL SEBEŞ |
Localitate | ALBA IULIA | ALBA IULIA | SEBEŞ | SEBEŞ | SEBEŞ |
Siruta | 1026 | 1026 | 1883 | 1883 | 1883 |
Nr sectie de votare | 1 | 10 | 100 | 101 | 102 |
Nume sectie de votare | CENTRUL DE ZI PENTRU PERSOANE VÂRSTNICE | COLEGIUL NAŢIONAL „HOREA CLOŞCA ŞI CRIŞAN” | ŞCOALA GIMNAZIALĂ NR. 2 SEBEŞ | COLEGIUL NAŢIONAL ”LUCIAN BLAGA” SEBEŞ | COLEGIUL NAŢIONAL ”LUCIAN BLAGA” SEBEŞ |
Mediu | U | U | U | U | U |
Votanti lista | 1612 | 1443 | 1415 | 1303 | 1362 |
LP | 901 | 648 | 769 | 697 | 765 |
LS | 45 | 143 | 66 | 73 | 27 |
UM | 0 | 0 | 14 | 42 | 0 |
LT | 946 | 791 | 849 | 812 | 792 |
Barbati 18-24 | 26 | 38 | 34 | 24 | 31 |
Barbati 25-34 | 58 | 68 | 56 | 56 | 35 |
Barbati 35-44 | 88 | 69 | 84 | 70 | 82 |
Barbati 45-64 | 165 | 128 | 157 | 136 | 136 |
Barbati 65+ | 102 | 71 | 88 | 78 | 58 |
Femei 18-24 | 32 | 40 | 30 | 30 | 35 |
Femei 25-34 | 72 | 56 | 61 | 61 | 48 |
Femei 35-44 | 107 | 79 | 76 | 92 | 122 |
Femei 45-64 | 178 | 161 | 163 | 167 | 160 |
Femei 65+ | 118 | 81 | 100 | 98 | 85 |
Barbati 18 | 7 | 9 | 5 | 4 | 3 |
Barbati 19 | 4 | 4 | 4 | 6 | 5 |
Barbati 20 | 5 | 7 | 6 | 2 | 9 |
Barbati 21 | 1 | 2 | 2 | 2 | 5 |
Barbati 22 | 2 | 7 | 7 | 1 | 7 |
Barbati 23 | 5 | 4 | 4 | 4 | 0 |
Barbati 24 | 2 | 5 | 6 | 5 | 2 |
Barbati 25 | 5 | 9 | 2 | 2 | 1 |
... | ... | ... | ... | ... | ... |
Femei 91 | 0 | 0 | 0 | 0 | 0 |
Femei 92 | 1 | 0 | 0 | 0 | 0 |
Femei 93 | 0 | 0 | 0 | 0 | 0 |
Femei 94 | 0 | 0 | 0 | 0 | 0 |
Femei 95 | 0 | 0 | 0 | 0 | 0 |
Femei 96 | 0 | 0 | 0 | 0 | 0 |
Femei 97 | 0 | 0 | 0 | 0 | 0 |
Femei 98 | 0 | 0 | 0 | 0 | 0 |
Femei 99 | 0 | 0 | 0 | 0 | 0 |
Femei 100 | 0 | 0 | 0 | 0 | 0 |
Femei 101 | 0 | 0 | 0 | 0 | 0 |
Femei 102 | 0 | 0 | 0 | 0 | 0 |
Femei 103 | 0 | 0 | 0 | 0 | 0 |
Femei 104 | 0 | 0 | 0 | 0 | 0 |
Femei 105 | 0 | 0 | 0 | 0 | 0 |
Femei 106 | 0 | 0 | 0 | 0 | 0 |
Femei 107 | 0 | 0 | 0 | 0 | 0 |
Femei 108 | 0 | 0 | 0 | 0 | 0 |
Femei 109 | 0 | 0 | 0 | 0 | 0 |
Femei 110 | 0 | 0 | 0 | 0 | 0 |
Femei 111 | 0 | 0 | 0 | 0 | 0 |
Femei 112 | 0 | 0 | 0 | 0 | 0 |
Femei 113 | 0 | 0 | 0 | 0 | 0 |
Femei 114 | 0 | 0 | 0 | 0 | 0 |
Femei 115 | 0 | 0 | 0 | 0 | 0 |
Femei 116 | 0 | 0 | 0 | 0 | 0 |
Femei 117 | 0 | 0 | 0 | 0 | 0 |
Femei 118 | 0 | 0 | 0 | 0 | 0 |
Femei 119 | 0 | 0 | 0 | 0 | 0 |
Femei 120 | 0 | 0 | 0 | 0 | 0 |
228 rows × 5 columns
It’s interesting that this file contains presence data on a per year-of-birth grouping (which is more granular than the 10 years buckets we had prior).
Reading all the results data into a single dataframe.
The individual results files we’ve got from two steps above, we will load them into a single big pandas DataFrame
Code
df_results = pd.concat((pd.read_csv(f) for f in tqdm(glob("_data/*_results_*"))), ignore_index=True)
df_results.shape
(19171, 36)
df_results.head().T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
Cod birou electoral | 22 | 5 | 35 | 35 | 6 |
Județ | HUNEDOARA | BIHOR | SUCEAVA | SUCEAVA | BISTRIŢA-NĂSĂUD |
Uat | ORAŞ CĂLAN | CEICA | VICOVU DE JOS | MUNICIPIUL SUCEAVA | BUDEŞTI |
Localitate | CĂLAN | BUCIUM | VICOVU DE JOS | SUCEAVA | BUDEŞTI-FÂNAŢE |
Secție | ŞCOALA GIMNAZIALĂ | ȘCOALA BUCIUM | SCOALA CU CLASELE I-VIII IOAN VICOVEANU | GRĂDINIŢA CU PROGRAM NORMAL NR.7 | ŞCOALA PRIMARĂ BUDEŞTI-FÎNAŢE |
Nr | 190 | 320 | 532 | 61 | 97 |
Tip | Europarlamentare | Europarlamentare | Europarlamentare | Europarlamentare | Europarlamentare |
a | 1471 | 172 | 1344 | 1393 | 256 |
a1 | 1471 | 172 | 1344 | 1393 | 256 |
a2 | 0 | 0 | 0 | 0 | 0 |
b | 534 | 116 | 520 | 625 | 162 |
b1 | 505 | 88 | 479 | 560 | 141 |
b2 | 0 | 0 | 0 | 0 | 0 |
b3 | 29 | 28 | 41 | 65 | 21 |
c | 1600 | 188 | 1500 | 1500 | 300 |
d | 1066 | 72 | 980 | 875 | 138 |
e | 514 | 113 | 504 | 605 | 153 |
f | 20 | 3 | 16 | 20 | 9 |
h | 0 | 0 | NU ESTE CAZUL | NU ESTE CAZUL | 0 |
i | 0 | 0 | FOARTE BUNA | FOARTE BUNA | 0 |
g1 | 111 | 46 | 174 | 128 | 18 |
g2 | 86 | 14 | 51 | 126 | 29 |
g3 | 29 | 5 | 34 | 43 | 7 |
g4 | 15 | 0 | 0 | 2 | 1 |
g5 | 176 | 28 | 162 | 153 | 62 |
g6 | 18 | 4 | 19 | 27 | 10 |
g7 | 2 | 3 | 2 | 6 | 1 |
g8 | 25 | 5 | 29 | 55 | 11 |
g9 | 6 | 0 | 2 | 3 | 3 |
g10 | 3 | 0 | 3 | 1 | 2 |
g11 | 3 | 0 | 3 | 2 | 2 |
g12 | 1 | 0 | 2 | 3 | 2 |
g13 | 2 | 0 | 0 | 1 | 0 |
g14 | 18 | 0 | 4 | 21 | 3 |
g15 | 9 | 2 | 9 | 12 | 1 |
g16 | 10 | 6 | 10 | 22 | 1 |
Joining the results with the presence data
Some code cleanup are neede. In order to join the two dataframes we need to make slight conversions to make all the keys from both side match.
Code
df_results.loc[df_results.Județ == "STRĂINĂTATE", "Județ"] = "STRAINATATE"
df_results.loc[df_results.Uat == "OMAN", "Uat"] = "SULTANATUL OMAN"
df_results.loc[df_results.Județ == "SECTOR 1", "Județ"] = "MUNICIPIUL BUCUREŞTI"
df_results.loc[df_results.Județ == "SECTOR 2", "Județ"] = "MUNICIPIUL BUCUREŞTI"
df_results.loc[df_results.Județ == "SECTOR 3", "Județ"] = "MUNICIPIUL BUCUREŞTI"
df_results.loc[df_results.Județ == "SECTOR 4", "Județ"] = "MUNICIPIUL BUCUREŞTI"
df_results.loc[df_results.Județ == "SECTOR 5", "Județ"] = "MUNICIPIUL BUCUREŞTI"
df_results.loc[df_results.Județ == "SECTOR 6", "Județ"] = "MUNICIPIUL BUCUREŞTI"
Now, if we merge the two we will get a single big dataframe with the same number of rows but double the columns.
Code
df_precint_with_results = pd.merge(left=df_precints, right=df_results, left_on=["county_name", "uat_name", "precinct_nr"], right_on=["Județ", "Uat", "Nr"])
df_precint_with_results.shape
(19171, 67)
Let’s print one example of how one entry this looks like in practice.
Code
dict(df_precint_with_results.iloc[0])
{'liste_permanente': 696,
'lista_suplimentare': 63,
'total': 759,
'urna_mobila': 0,
'county_code': 'VS',
'county_name': 'VASLUI',
'id_county': '39',
'id_locality': '9015',
'id_precinct': '16128',
'id_uat': '2936',
'initial_count': 1470,
'latitude': '46.6389853639550',
'locality_name': 'VASLUI',
'longitude': '27.7326775437114',
'medium': 'U',
'precinct_name': 'CASA DE CULTURĂ A SINDICATELOR ”CONSTANTIN TĂNASE”',
'precinct_nr': 1,
'presence': 51.6327,
'siruta': '161954',
'uat_code': None,
'uat_name': 'MUNICIPIUL VASLUI',
'men_18_24': 13,
'men_25_34': 31,
'men_35_44': 66,
'men_45_64': 110,
'men_65+': 115,
'women_18_24': 22,
'women_25_34': 34,
'women_35_44': 78,
'women_45_64': 171,
'women_65+': 119,
'Cod birou electoral': 39,
'Județ': 'VASLUI',
'Uat': 'MUNICIPIUL VASLUI',
'Localitate': 'VASLUI',
'Secție': 'CASA DE CULTURĂ A SINDICATELOR ”CONSTANTIN TĂNASE”',
'Nr': 1,
'Tip': 'Europarlamentare',
'a': 1470,
'a1': 1470,
'a2': 0,
'b': 759,
'b1': 695,
'b2': 0,
'b3': 64,
'c': 1500,
'd': 741,
'e': 741,
'f': 18,
'h': 0,
'i': 0,
'g1': 185,
'g2': 232,
'g3': 51,
'g4': 0,
'g5': 118,
'g6': 37,
'g7': 2,
'g8': 68,
'g9': 0,
'g10': 4,
'g11': 5,
'g12': 3,
'g13': 2,
'g14': 18,
'g15': 9,
'g16': 7}
We will also join the data with the all_presence.csv
file.
Code
df_full = pd.merge(left=df_precint_with_results, right=_all_df, left_on=["county_code", "uat_name", "precinct_nr"], right_on=["Judet", "UAT", "Nr sectie de votare"])
df_full.shape
(19171, 295)
Applying the legend
Some of the columns in the above dataframe are not quite obvious (g1
, .., g16
, etc..). These are party names that I was only able to find in a legend in the dropdown of a button in the UI of the site. I’ve copied it here, along with explanations of some fields that I’ve been able to figure out by looking over the PDF’s of official scanned documents.
We also need to convert these column names into more meaningfull lables.
Code
df_full.rename(columns={
"g1": "PSD",
"g2": "USR-PLUS",
"g3": "PRO Romania",
"g4": "UDMR",
"g5": "PNL",
"g6": "ALDE",
"g7": "PRODEMO",
"g8": "PMP",
"g9": "Partidul Socialist Roman",
"g10": "Partidul Social Democrat Independent",
"g11": "Partidul Romania Unita",
"g12": "Uniunea Nationala Pentur Progresul Romaniei",
"g13": "Blocul Unitatii Nationale",
"g14": "Gregoriana-Carmen Tudoran",
"g15": "George-Nicaolae Simion",
"g16": "Peter Costea",
"a": "Total alegatori",
"a1": "Total lista permanenta",
"a2": "Total urna mobila",
"b": "Total prezenti",
"b1": "Prezenti lista permanenta",
"b2": "Prezenti urna mobila",
"b3": "Prezenti lista suplimentara",
"c": "Total voturi",
"d": "Voturi nefolosite",
"e": "Voturi valabile",
"f": "Voturi anulate",
"h": "Contestatii",
"i": "Starea sigiliilor"
}, inplace=True)
Ok, let’s check for the amount of missing data
Code
na_series = df_full.isna().sum()
na_series[na_series != 0]
latitude 479
longitude 479
siruta 441
uat_code 18730
Contestatii 19
Starea sigiliilor 8
Siruta 441
Removing duplicate columns
Because we’ve basically merged two types of presence
datasets (the per-county
one and the all_presence.csv
one) we ended up with some duplicate columns in the joined dataframe. We also have as duplicates the join on
columns, and columns that contained the same type of information.
We want to eliminate those. We will find the duplicated columns by:
- using the
pandas.duplicated
method (used on the transposed matix - duplicated only works on rows) - looking at the correlation matrix of the resulting columns and get the pairs of columns that have the highes correlation.
Code
duplicate_columns = df_full.columns[df_full.T.duplicated(keep=False)]
duplicate_columns
Index(['liste_permanente', 'lista_suplimentare', 'total', 'urna_mobila',
'county_code', 'county_name', 'initial_count', 'locality_name',
'medium', 'precinct_name', 'precinct_nr', 'uat_name', 'men_18_24',
'men_25_34', 'men_35_44', 'men_45_64', 'men_65+', 'women_18_24',
'women_25_34', 'women_35_44', 'women_45_64', 'women_65+', 'Județ',
'Uat', 'Nr', 'Judet', 'UAT', 'Localitate_y', 'Nr sectie de votare',
'Nume sectie de votare', 'Mediu', 'Votanti lista', 'LP', 'LS', 'UM',
'LT', 'Barbati 18-24', 'Barbati 25-34', 'Barbati 35-44',
'Barbati 45-64', 'Barbati 65+', 'Femei 18-24', 'Femei 25-34',
'Femei 35-44', 'Femei 45-64', 'Femei 65+', 'Barbati 104', 'Barbati 106',
'Barbati 108', 'Barbati 109', 'Barbati 110', 'Barbati 112',
'Barbati 113', 'Barbati 114', 'Barbati 115', 'Barbati 116',
'Barbati 117', 'Barbati 118', 'Barbati 119', 'Barbati 120', 'Femei 105',
'Femei 106', 'Femei 107', 'Femei 108', 'Femei 110', 'Femei 111',
'Femei 112', 'Femei 113', 'Femei 114', 'Femei 115', 'Femei 116',
'Femei 117', 'Femei 118', 'Femei 119', 'Femei 120'],
dtype='object')
With these, we will compare each with each and see what searies are equals. This will results in a long list of pairs of columns that are duplicates of one another.
Code
_pairs = set()
for i, _a in enumerate(duplicate_columns):
for _b in duplicate_columns[i+1:]:
if (df_full[_a] == df_full[_b]).all():
_pairs.add(tuple(sorted([_a, _b])))
_pairs
{('Barbati 104', 'Barbati 106'),
('Barbati 104', 'Barbati 108'),
('Barbati 104', 'Barbati 109'),
...
('Barbati 120', 'Femei 120'),
('Barbati 18-24', 'men_18_24'),
('Barbati 25-34', 'men_25_34'),
('Barbati 35-44', 'men_35_44'),
('Barbati 45-64', 'men_45_64'),
('Barbati 65+', 'men_65+'),
('Femei 105', 'Femei 106'),
('Femei 105', 'Femei 107'),
...
('Femei 119', 'Femei 120'),
('Femei 18-24', 'women_18_24'),
('Femei 25-34', 'women_25_34'),
('Femei 35-44', 'women_35_44'),
('Femei 45-64', 'women_45_64'),
('Femei 65+', 'women_65+'),
('Judet', 'county_code'),
('Județ', 'county_name'),
('LP', 'liste_permanente'),
('LS', 'lista_suplimentare'),
('LT', 'total'),
('Localitate_y', 'locality_name'),
('Mediu', 'medium'),
('Nr', 'Nr sectie de votare'),
('Nr', 'precinct_nr'),
('Nr sectie de votare', 'precinct_nr'),
('Nume sectie de votare', 'precinct_name'),
('UAT', 'Uat'),
('UAT', 'uat_name'),
('UM', 'urna_mobila'),
('Uat', 'uat_name'),
('Votanti lista', 'initial_count')}
There’s only one more step that we need to do: find the groups of columns that have the same information. There are cases where the columns are not only duplicated but triplicated, which results in (A == B), (B == C), (C == A) pairs in the analisys above.
This is the perfect job of the disjoint-set datastructure.
Code
_groups = DisjointSets()
for _a, _b in _pairs:
_groups.union(_a, _b)
_groups.sets()
[['Barbati 109',
'Femei 118',
'Femei 110',
'Femei 114',
'Barbati 104',
'Femei 115',
'Barbati 116',
'Barbati 117',
'Barbati 114',
'Femei 107',
'Femei 119',
'Femei 105',
'Barbati 115',
'Femei 108',
'Barbati 118',
'Barbati 108',
'Barbati 119',
'Femei 120',
'Femei 116',
'Barbati 112',
'Femei 113',
'Barbati 113',
'Barbati 120',
'Femei 117',
'Barbati 106',
'Barbati 110',
'Femei 106',
'Femei 112',
'Femei 111'],
['LP', 'liste_permanente'],
['Femei 35-44', 'women_35_44'],
['LT', 'total'],
['Nr sectie de votare', 'precinct_nr', 'Nr'],
['UM', 'urna_mobila'],
['Mediu', 'medium'],
['Barbati 65+', 'men_65+'],
['Barbati 35-44', 'men_35_44'],
['Femei 18-24', 'women_18_24'],
['Votanti lista', 'initial_count'],
['Femei 25-34', 'women_25_34'],
['Barbati 25-34', 'men_25_34'],
['UAT', 'Uat', 'uat_name'],
['Barbati 18-24', 'men_18_24'],
['Barbati 45-64', 'men_45_64'],
['Localitate_y', 'locality_name'],
['Femei 45-64', 'women_45_64'],
['Nume sectie de votare', 'precinct_name'],
['Judet', 'county_code'],
['Femei 65+', 'women_65+'],
['Județ', 'county_name'],
['LS', 'lista_suplimentare']]
From the list above we know we choose to drop the following columns:
Code
drop_columns = ['Barbati 104', 'Barbati 106',
'Barbati 108', 'Barbati 109', 'Barbati 110', 'Barbati 112',
'Barbati 113', 'Barbati 114', 'Barbati 115', 'Barbati 116',
'Barbati 117', 'Barbati 118', 'Barbati 119', 'Barbati 120', 'Femei 105',
'Femei 106', 'Femei 107', 'Femei 108', 'Femei 110', 'Femei 111',
'Femei 112', 'Femei 113', 'Femei 114', 'Femei 115', 'Femei 116',
'Femei 117', 'Femei 118', 'Femei 119', 'Femei 120', 'LP', 'Femei 35-44', 'LT', 'Nr sectie de votare', 'Nr', 'UM',
'Mediu', 'Barbati 65+', 'Barbati 35-44','Femei 18-24', 'initial_count', 'Femei 25-34', 'Barbati 25-34',
'UAT', 'Uat', 'Barbati 18-24', 'Barbati 45-64', 'Localitate_y', 'Femei 45-64', 'Femei 45-64', 'Nume sectie de votare',
'Judet', 'Femei 65+', 'Județ', 'LS',
]
df_final = df_full.drop(columns=drop_columns)
df_final.columns
Index(['liste_permanente', 'lista_suplimentare', 'total', 'urna_mobila',
'county_code', 'county_name', 'id_county', 'id_locality', 'id_precinct',
'id_uat',
...
'Femei 96', 'Femei 97', 'Femei 98', 'Femei 99', 'Femei 100',
'Femei 101', 'Femei 102', 'Femei 103', 'Femei 104', 'Femei 109'],
dtype='object', length=242)
And we end up with..
Code
df_final.head().T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
liste_permanente | 696 | 140 | 501 | 571 | 680 |
lista_suplimentare | 63 | 10 | 25 | 41 | 55 |
total | 759 | 150 | 526 | 612 | 736 |
urna_mobila | 0 | 0 | 0 | 0 | 1 |
county_code | VS | VS | VS | VS | VS |
county_name | VASLUI | VASLUI | VASLUI | VASLUI | VASLUI |
id_county | 39 | 39 | 39 | 39 | 39 |
id_locality | 9015 | 9015 | 9006 | 9006 | 9006 |
id_precinct | 16128 | 16187 | 16086 | 16087 | 16088 |
id_uat | 2936 | 2936 | 2933 | 2933 | 2933 |
latitude | 46.6389853639550 | 46.6421141774663 | 46.2240238056566 | 46.2278431009305 | 46.2278431009305 |
locality_name | VASLUI | VASLUI | BÂRLAD | BÂRLAD | BÂRLAD |
longitude | 27.7326775437114 | 27.7289502189002 | 27.6775710052581 | 27.6686353095150 | 27.6686353095150 |
medium | U | U | U | U | U |
precinct_name | CASA DE CULTURĂ A SINDICATELOR ”CONSTANTIN TĂN... | ȘCOALA GIMNAZIALĂ ”CONSTANTIN PARFENE” | ŞCOALA GIMNAZIALĂ ”VICTOR IOAN POPA” | CASA DE CULTURĂ A SINDICATELOR ”GEORGE TUTOVEANU” | CASA DE CULTURĂ A SINDICATELOR ”GEORGE TUTOVEANU” |
precinct_nr | 1 | 10 | 100 | 101 | 102 |
presence | 51.6327 | 8.1522 | 38.8479 | 44.5091 | 46.879 |
siruta | 161954 | 161954 | 161801 | 161801 | 161801 |
uat_code | None | None | None | None | None |
uat_name | MUNICIPIUL VASLUI | MUNICIPIUL VASLUI | MUNICIPIUL BÂRLAD | MUNICIPIUL BÂRLAD | MUNICIPIUL BÂRLAD |
men_18_24 | 13 | 4 | 18 | 15 | 11 |
men_25_34 | 31 | 5 | 25 | 29 | 40 |
men_35_44 | 66 | 17 | 60 | 52 | 60 |
men_45_64 | 110 | 28 | 104 | 113 | 158 |
men_65+ | 115 | 20 | 43 | 72 | 81 |
women_18_24 | 22 | 6 | 17 | 22 | 14 |
women_25_34 | 34 | 8 | 33 | 32 | 46 |
women_35_44 | 78 | 15 | 64 | 55 | 52 |
women_45_64 | 171 | 28 | 117 | 127 | 178 |
women_65+ | 119 | 19 | 45 | 95 | 96 |
... | ... | ... | ... | ... | ... |
Femei 76 | 5 | 2 | 0 | 7 | 2 |
Femei 77 | 4 | 0 | 2 | 0 | 0 |
Femei 78 | 7 | 0 | 2 | 2 | 3 |
Femei 79 | 4 | 1 | 1 | 2 | 3 |
Femei 80 | 5 | 0 | 3 | 1 | 1 |
Femei 81 | 4 | 0 | 0 | 3 | 0 |
Femei 82 | 2 | 0 | 0 | 4 | 3 |
Femei 83 | 1 | 0 | 1 | 2 | 1 |
Femei 84 | 0 | 1 | 0 | 3 | 1 |
Femei 85 | 1 | 0 | 0 | 1 | 0 |
Femei 86 | 2 | 1 | 0 | 1 | 0 |
Femei 87 | 1 | 1 | 0 | 0 | 2 |
Femei 88 | 2 | 1 | 0 | 0 | 0 |
Femei 89 | 0 | 0 | 0 | 0 | 1 |
Femei 90 | 0 | 0 | 0 | 0 | 0 |
Femei 91 | 0 | 0 | 0 | 0 | 0 |
Femei 92 | 0 | 0 | 0 | 0 | 0 |
Femei 93 | 0 | 0 | 0 | 0 | 0 |
Femei 94 | 0 | 0 | 0 | 0 | 0 |
Femei 95 | 0 | 0 | 0 | 0 | 0 |
Femei 96 | 0 | 0 | 0 | 0 | 0 |
Femei 97 | 0 | 0 | 0 | 0 | 0 |
Femei 98 | 0 | 0 | 0 | 0 | 0 |
Femei 99 | 0 | 0 | 0 | 0 | 0 |
Femei 100 | 0 | 0 | 0 | 0 | 0 |
Femei 101 | 0 | 0 | 0 | 0 | 0 |
Femei 102 | 0 | 0 | 0 | 0 | 0 |
Femei 103 | 0 | 0 | 0 | 0 | 0 |
Femei 104 | 0 | 0 | 0 | 0 | 0 |
Femei 109 | 0 | 0 | 0 | 0 | 0 |
242 rows × 5 columns
Save the data to csv format
We’re almost done. We only need to save the dataset on disk and start using it (to be continued, in a future post)!
df_final.to_csv("_data/final.csv")
Comments