Getting data of an election

23 minute read

Getting data of an election

On the 26th of May, we had the european parliamentary elections. In Romania, the results and progress of the vote were published online in real time on the official electoral site.

As far as I know it’s the first time we had such data exposed to the public, and with such granularity.

Since my daily work involves working closely with data, I couldn’t miss the oportunity to get my hands on that dataset. Of course, the site is not documented and there aren’t any publicly available API’s to begin with. So I spent some hours debugging the underlying stack to see how I can query it and compile it in a usable format.

As far as I see it’s built with React, using some NoSQL as the backend. I’m betting on NoSQL because, while doing the ETL I’ve found some schema inconsistencies that shouldn’t normally happen if the data sat on top of a SQL DB.

Understanding the API

Maybe there’s better way to do this, but what I did was start the developer console of the browser, refresh the election page and look for a request that seemed to contain the data that I was looking for. Using this approach I’ve found the following endopoints that I could query.

An example of how you interogate the BEC site, for the presence data. You need to know the county (in this case AR, code for ARAD).

Code

!curl 'https://prezenta.bec.ro/europarlamentare26052019/data/pv/json//pv_AR.json' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-pv-part' -H 'authority: prezenta.bec.ro' --compressed -o "_data/AR.json"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 73871    0 73871    0     0   319k      0 --:--:-- --:--:-- --:--:--  319k

There is also an endpoint publishing the presence count (the number of people that voted so far ). Again, we also need to query this for each county.

Code

!curl 'https://prezenta.bec.ro/europarlamentare26052019/data/presence/json/presence_AR_now.json' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-precincts' -H 'authority: prezenta.bec.ro' --compressed -o "_data/AR-presence.json"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 60786    0 60786    0     0   690k      0 --:--:-- --:--:-- --:--:--  690k

There is also another csv that we can use, which contains the presence data in a single big file.

Code

!curl 'https://prezenta.bec.ro/europarlamentare26052019/data/presence/csv/presence_now.csv' -H 'Referer: https://prezenta.bec.ro/europarlamentare26052019/abroad-pv-part' --compressed -o "_data/all_presence.csv"

Fetching the data

Getting all the data ouf of the site. Each county has a dedicated page which contains information about its stats. By looking over the source of the site we can compile a list of all counties that we need to inspect. It’s instersting that the S1..S6 (Bucharest’s sectors) were modeled as counties.

Code

counties = ["AR", "AB", "AR", "AG", "BC", "BH", "BN", "BT", "BV", "BR", "BZ", "CS", "CL", "CJ", "CT", "CV", "DB", "DJ", "GL", "GR", "GJ", "HR", "HD", "IL", "IS", "IF", "MM", "MH", "MS", "NT", "OT", "PH", "SM", "SJ", "SB", "SV", "TR", "TM", "TL", "VS", "VL", "VN", "B", "SR", "S1", "S2", "S3", "S4", "S5", "S6"]
len(counties)

The vote information is stored on the data/pv/json/ route, specific for each county. In order not to make multiple queries while testing, we first cache all the results localy and we can refer to them later on.

Above we’ve deduced the counties we can have, but I’ve found that there are slight asymetries for certain cases (mostly regarding the expat data and the way Bucharest is represented).

It’s because of this that we need to handle the counties list in a case-by-case fashion.

Code

import json
from tqdm import tqdm_notebook as tqdm

counties = ["AR", "AB", "AR", "AG", "BC", "BH", "BN", "BT", "BV", "BR", "BZ", "CS", "CL", "CJ", "CT", "CV", "DB", "DJ", "GL", "GR", "GJ", "HR", "HD", "IL", "IS", "IF", "MM", "MH", "MS", "NT", "OT", "PH", "SM", "SJ", "SB", "SV", "TR", "TM", "TL", "VS", "VL", "VN", "S1", "S2", "S3", "S4", "S5", "S6"]
for county in tqdm(counties):
    !curl 'https://prezenta.bec.ro/europarlamentare26052019/data/pv/json//pv_{county}.json' -H 'accept-encoding: gzip, deflate, br'  -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-pv-part' -H 'authority: prezenta.bec.ro' --compressed -o "_data/{county}.json"

The presence data is (as above) stored in a different (data/presence/json/) route specific to each county. Again, we fetch everything an cache localy.

Code

counties = ["AR", "AB", "AR", "AG", "BC", "BH", "BN", "BT", "BV", "BR", "BZ", "CS", "CL", "CJ", "CT", "CV", "DB", "DJ", "GL", "GR", "GJ", "HR", "HD", "IL", "IS", "IF", "MM", "MH", "MS", "NT", "OT", "PH", "SM", "SJ", "SB", "SV", "TR", "TM", "TL", "VS", "VL", "VN", "B", "SR"]
for county in tqdm(counties[-8:-6]):
    !curl 'https://prezenta.bec.ro/europarlamentare26052019/data/presence/json/presence_{county}_now.json' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-precincts' -H 'authority: prezenta.bec.ro' --compressed -o "_data/{county}-presence.json"

Let’s also get the all in one data about the presence.

Code

!curl 'https://prezenta.bec.ro/europarlamentare26052019/data/presence/csv/presence_now.csv' -H 'Referer: https://prezenta.bec.ro/europarlamentare26052019/abroad-pv-part' --compressed -o "_data/all_presence.csv"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2618k    0 2618k    0     0  5729k      0 --:--:-- --:--:-- --:--:-- 5729k

Compiling the data

Loading a `presence` file

When reading the presence file, there’s some manipulation that we need to do because the original returned json contains lots of information that seemed either useless or redundant (info we already had in other places), or information that I didn’t know how to interpret.

There was also the age_ranges field which was contained actually a list of values, that I needed to exapend into individual columns, by using a transform function.

Code

def process_row(row):
    return tuple(row.age_ranges.values())

def load_presence(presence_file_name):
    _json = read_json_file(presence_file_name)
    _df = pd.DataFrame.from_records(_json['precinct'])
    _df[["men_18_24", "men_25_34", "men_35_44", "men_45_64", "men_65+", "women_18_24", "women_25_34", "women_35_44", "women_45_64", "women_65+"]] = _df.apply(process_row, axis=1, result_type='expand')
    _df.drop(columns=['age_ranges'], inplace=True)
    _df.columns = [
        'liste_permanente', 
        'lista_suplimentare', 
        'total', 
        'urna_mobila', 
        'county_code',
        'county_name',
        'id_county',
        'id_locality',
        'id_precinct',
        'id_uat',
        'initial_count',
        'latitude',
        'locality_name',
        'longitude',
        'medium',
        'precinct_name',
        'precinct_nr',
        'presence',
        'siruta',
        'uat_code',
        'uat_name',
        'men_18_24',
        'men_25_34',
        'men_35_44',
        'men_45_64',
        'men_65+',
        'women_18_24',
        'women_25_34',
        'women_35_44',
        'women_45_64',
        'women_65+',
    ]
    return _df

tulcea = load_presence("_data/TL-presence.json")
tulcea.head()

	liste_permanente	lista_suplimentare	total	county_code	county_name	id_county	id_locality	id_precinct	id_uat	...	men_18_24	men_25_34	men_35_44	men_45_64	men_65+	women_18_24	women_25_34	women_35_44	women_45_64	women_65+
0	409	31	440	TL	TULCEA	38	8884	15848	2882	...	13	23	36	84	29	17	22	55	110	51
1	471	73	544	TL	TULCEA	38	8884	15849	2882	...	11	28	55	90	71	10	40	62	92	85
2	14	20	34	TL	TULCEA	38	8909	15943	2892	...	1	1	5	10	4	0	2	4	5	2
3	134	35	169	TL	TULCEA	38	8914	15948	2894	...	5	6	17	37	21	4	9	7	37	26
4	638	70	708	TL	TULCEA	38	8912	15946	2894	...	19	42	46	185	78	17	42	53	142	84

5 rows × 31 columns

Getting all the individual `result` files for all `precint_nr`

The voting results (what voted for what, in which place) are stored in the data/pv route. The problem is that if we only use the county file, downloaded in step 3, there isn’t any keys that will link the results to the presence rows. To be more exact, the results are listed per voting facilty, whereas the presence is aggregated at a locality level. This means that we can’t really jon the two.

Fortuantely, I’ve found that if you know the id of a specific voting facilty, you could ask for the resuls of that specific facility through a version of the original results API.

So the final strategy that worked was something along the following lines:

For all countyes, we will load the presence file associated to it
inspect all the precint_nr that is contains
individually query the api for the results of that precint_nr.

Code

counties = ["AR", "AB", "AR", "AG", "BC", "BH", "BN", "BT", "BV", "BR", "BZ", "CS", "CL", "CJ", "CT", "CV", "DB", "DJ", "GL", "GR", "GJ", "HR", "HD", "IL", "IS", "IF", "MM", "MH", "MS", "NT", "OT", "PH", "SM", "SJ", "SB", "SV", "TR", "TM", "TL", "VS", "VL", "VN", "SR"]
for county in tqdm(counties):
    df_county = load_presence(f"_data/{county}-presence.json")
    for precinct_nr in tqdm(df_county['precinct_nr'].values, leave=False):
        file_name = f"_data/{county}_results_{precinct_nr}.json"
        if not os.path.exists(file_name): 
            !curl 'https://prezenta.bec.ro/europarlamentare26052019/data/pv/csv/pv_{county}_{precinct_nr}_EUP_PART.csv' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-pv-part' -H 'authority: prezenta.bec.ro' -H 'cookie: _ga=GA1.2.772980748.1558943895; _gid=GA1.2.1466959792.1561374632' --compressed --silent -o "_data/{county}_results_{precinct_nr}.json"
        else:
            with open(file_name) as f:
                file_contents = f.read()
            if "Cod birou electoral" not in file_contents:
                print(f"File: {file_name} has bad content {file_contents[:50]}. Will retry")
                os.remove(file_name)
                !curl 'https://prezenta.bec.ro/europarlamentare26052019/data/pv/csv/pv_{county}_{precinct_nr}_EUP_PART.csv' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-pv-part' -H 'authority: prezenta.bec.ro' -H 'cookie: _ga=GA1.2.772980748.1558943895; _gid=GA1.2.1466959792.1561374632' --compressed --silent -o "_data/{county}_results_{precinct_nr}.json"

Bucharest is a special case. It’s treated as a county but the results are stored by sectors so we need to do things a bit different.

Code

county = "B"
df_county = load_presence(f"_data/{county}-presence.json")
for id_sector in tqdm(df_county.id_locality.unique()):
    sector = f"S{int(id_sector) - 9910 + 1}"
    print(f"Processing: {sector}")
    county = sector
    for precinct_nr in tqdm(df_county[df_county.id_locality == id_sector]['precinct_nr'].values, leave=False):
        file_name = f"_data/{county}_results_{precinct_nr}.json"
        if not os.path.exists(file_name): 
            !curl 'https://prezenta.bec.ro/europarlamentare26052019/data/pv/csv/pv_{county}_{precinct_nr}_EUP_PART.csv' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-pv-part' -H 'authority: prezenta.bec.ro' -H 'cookie: _ga=GA1.2.772980748.1558943895; _gid=GA1.2.1466959792.1561374632' --compressed --silent -o "_data/{county}_results_{precinct_nr}.json"
        else:
            with open(file_name) as f:
                file_contents = f.read()
            if "Cod birou electoral" not in file_contents:
                print(f"File: {file_name} has bad content {file_contents[:50]}. Will retry")
                os.remove(file_name)
                !curl 'https://prezenta.bec.ro/europarlamentare26052019/data/pv/csv/pv_{county}_{precinct_nr}_EUP_PART.csv' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-pv-part' -H 'authority: prezenta.bec.ro' -H 'cookie: _ga=GA1.2.772980748.1558943895; _gid=GA1.2.1466959792.1561374632' --compressed --silent -o "_data/{county}_results_{precinct_nr}.json"

In SR we have data about the foreign offices.

Code

load_presence("_data/SR-presence.json").head().T

	0	1	2	3	4
liste_permanente	0	0	0	0	0
lista_suplimentare	18	57	865	79	1330
total	18	57	865	79	1330
urna_mobila	0	0	0	0	0
county_code	SR	SR	SR	SR	SR
county_name	STRAINATATE	STRAINATATE	STRAINATATE	STRAINATATE	STRAINATATE
id_county	43	43	43	43	43
id_locality	10244	10178	10334	10206	9996
id_precinct	18619	18627	19096	18716	18723
id_uat	3230	3218	3245	3219	3186
initial_count	0	0	0	0	0
latitude	None	None	None	None	None
locality_name	KABUL	BUENOS AIRES	RENNES	TBILISI	STUTTGART
longitude	None	None	None	None	None
medium	U	U	U	U	U
precinct_name	Eticheta Credentiale/Tableta - 1	Eticheta Credentiale/Tableta - 10	Eticheta Credentiale/Tableta - 100	Eticheta Credentiale/Tableta - 101	Eticheta Credentiale/Tableta - 102
precinct_nr	1	10	100	101	102
presence	0	0	0	0	0
siruta	None	None	None	None	None
uat_code	AF	AR	FR	GE	DE
uat_name	AFGANISTAN	ARGENTINA	FRANȚA	GEORGIA	GERMANIA
men_18_24	0	1	50	1	71
men_25_34	4	3	155	9	258
men_35_44	11	4	172	18	270
men_45_64	3	7	141	14	249
men_65+	0	1	8	1	4
women_18_24	0	0	37	1	50
women_25_34	0	10	119	13	182
women_35_44	0	13	93	11	128
women_45_64	0	11	76	9	112
women_65+	0	7	14	2	6

Reading all the presence data into a single `DataFrame`

We now have all the presence data cached, and we’ll read it into a single dataframe.

Code

counties = ["AR", "AB", "AR", "AG", "BC", "BH", "BN", "BT", "BV", "BR", "BZ", "CS", "CL", "CJ", "CT", "CV", "DB", "DJ", "GL", "GR", "GJ", "HR", "HD", "IL", "IS", "IF", "MM", "MH", "MS", "NT", "OT", "PH", "SM", "SJ", "SB", "SV", "TR", "TM", "TL", "VS", "VL", "VN", "SR", "B"]
 
df_precints = pd.concat((load_presence(f) for f in tqdm(glob("_data/*-presence.json"))), ignore_index=True)
df_precints.shape

(19171, 31)

Code

df_precints.head().T

	0	1	2	3	4
liste_permanente	696	140	501	571	680
lista_suplimentare	63	10	25	41	55
total	759	150	526	612	736
urna_mobila	0	0	0	0	1
county_code	VS	VS	VS	VS	VS
county_name	VASLUI	VASLUI	VASLUI	VASLUI	VASLUI
id_county	39	39	39	39	39
id_locality	9015	9015	9006	9006	9006
id_precinct	16128	16187	16086	16087	16088
id_uat	2936	2936	2933	2933	2933
initial_count	1470	1840	1354	1375	1570
latitude	46.6389853639550	46.6421141774663	46.2240238056566	46.2278431009305	46.2278431009305
locality_name	VASLUI	VASLUI	BÂRLAD	BÂRLAD	BÂRLAD
longitude	27.7326775437114	27.7289502189002	27.6775710052581	27.6686353095150	27.6686353095150
medium	U	U	U	U	U
precinct_name	CASA DE CULTURĂ A SINDICATELOR ”CONSTANTIN TĂN...	ȘCOALA GIMNAZIALĂ ”CONSTANTIN PARFENE”	ŞCOALA GIMNAZIALĂ ”VICTOR IOAN POPA”	CASA DE CULTURĂ A SINDICATELOR ”GEORGE TUTOVEANU”	CASA DE CULTURĂ A SINDICATELOR ”GEORGE TUTOVEANU”
precinct_nr	1	10	100	101	102
presence	51.6327	8.1522	38.8479	44.5091	46.879
siruta	161954	161954	161801	161801	161801
uat_code	None	None	None	None	None
uat_name	MUNICIPIUL VASLUI	MUNICIPIUL VASLUI	MUNICIPIUL BÂRLAD	MUNICIPIUL BÂRLAD	MUNICIPIUL BÂRLAD
men_18_24	13	4	18	15	11
men_25_34	31	5	25	29	40
men_35_44	66	17	60	52	60
men_45_64	110	28	104	113	158
men_65+	115	20	43	72	81
women_18_24	22	6	17	22	14
women_25_34	34	8	33	32	46
women_35_44	78	15	64	55	52
women_45_64	171	28	117	127	178
women_65+	119	19	45	95	96

The all_presence.csv file contains information about age groups, more granular than the bucketed info found in the county files. We will merge it with the current dataframe.

Code

_all_df = pd.read_csv("_data/all_presence.csv")

_all_df.head().T

	0	1	2	3	4
Judet	AB	AB	AB	AB	AB
UAT	MUNICIPIUL ALBA IULIA	MUNICIPIUL ALBA IULIA	MUNICIPIUL SEBEŞ	MUNICIPIUL SEBEŞ	MUNICIPIUL SEBEŞ
Localitate	ALBA IULIA	ALBA IULIA	SEBEŞ	SEBEŞ	SEBEŞ
Siruta	1026	1026	1883	1883	1883
Nr sectie de votare	1	10	100	101	102
Nume sectie de votare	CENTRUL DE ZI PENTRU PERSOANE VÂRSTNICE	COLEGIUL NAŢIONAL „HOREA CLOŞCA ŞI CRIŞAN”	ŞCOALA GIMNAZIALĂ NR. 2 SEBEŞ	COLEGIUL NAŢIONAL ”LUCIAN BLAGA” SEBEŞ	COLEGIUL NAŢIONAL ”LUCIAN BLAGA” SEBEŞ
Mediu	U	U	U	U	U
Votanti lista	1612	1443	1415	1303	1362
LP	901	648	769	697	765
LS	45	143	66	73	27
UM	0	0	14	42	0
LT	946	791	849	812	792
Barbati 18-24	26	38	34	24	31
Barbati 25-34	58	68	56	56	35
Barbati 35-44	88	69	84	70	82
Barbati 45-64	165	128	157	136	136
Barbati 65+	102	71	88	78	58
Femei 18-24	32	40	30	30	35
Femei 25-34	72	56	61	61	48
Femei 35-44	107	79	76	92	122
Femei 45-64	178	161	163	167	160
Femei 65+	118	81	100	98	85
Barbati 18	7	9	5	4	3
Barbati 19	4	4	4	6	5
Barbati 20	5	7	6	2	9
Barbati 21	1	2	2	2	5
Barbati 22	2	7	7	1	7
Barbati 23	5	4	4	4	0
Barbati 24	2	5	6	5	2
Barbati 25	5	9	2	2	1
...	...	...	...	...	...
Femei 91	0	0	0	0	0
Femei 92	1	0	0	0	0
Femei 93	0	0	0	0	0
Femei 94	0	0	0	0	0
Femei 95	0	0	0	0	0
Femei 96	0	0	0	0	0
Femei 97	0	0	0	0	0
Femei 98	0	0	0	0	0
Femei 99	0	0	0	0	0
Femei 100	0	0	0	0	0
Femei 101	0	0	0	0	0
Femei 102	0	0	0	0	0
Femei 103	0	0	0	0	0
Femei 104	0	0	0	0	0
Femei 105	0	0	0	0	0
Femei 106	0	0	0	0	0
Femei 107	0	0	0	0	0
Femei 108	0	0	0	0	0
Femei 109	0	0	0	0	0
Femei 110	0	0	0	0	0
Femei 111	0	0	0	0	0
Femei 112	0	0	0	0	0
Femei 113	0	0	0	0	0
Femei 114	0	0	0	0	0
Femei 115	0	0	0	0	0
Femei 116	0	0	0	0	0
Femei 117	0	0	0	0	0
Femei 118	0	0	0	0	0
Femei 119	0	0	0	0	0
Femei 120	0	0	0	0	0

228 rows × 5 columns

It’s interesting that this file contains presence data on a per year-of-birth grouping (which is more granular than the 10 years buckets we had prior).

Reading all the results data into a single dataframe.

The individual results files we’ve got from two steps above, we will load them into a single big pandas DataFrame

Code

df_results = pd.concat((pd.read_csv(f) for f in tqdm(glob("_data/*_results_*"))), ignore_index=True)
df_results.shape

(19171, 36)

df_results.head().T

	0	1	2	3	4
Cod birou electoral	22	5	35	35	6
Județ	HUNEDOARA	BIHOR	SUCEAVA	SUCEAVA	BISTRIŢA-NĂSĂUD
Uat	ORAŞ CĂLAN	CEICA	VICOVU DE JOS	MUNICIPIUL SUCEAVA	BUDEŞTI
Localitate	CĂLAN	BUCIUM	VICOVU DE JOS	SUCEAVA	BUDEŞTI-FÂNAŢE
Secție	ŞCOALA GIMNAZIALĂ	ȘCOALA BUCIUM	SCOALA CU CLASELE I-VIII IOAN VICOVEANU	GRĂDINIŢA CU PROGRAM NORMAL NR.7	ŞCOALA PRIMARĂ BUDEŞTI-FÎNAŢE
Nr	190	320	532	61	97
Tip	Europarlamentare	Europarlamentare	Europarlamentare	Europarlamentare	Europarlamentare
a	1471	172	1344	1393	256
a1	1471	172	1344	1393	256
a2	0	0	0	0	0
b	534	116	520	625	162
b1	505	88	479	560	141
b2	0	0	0	0	0
b3	29	28	41	65	21
c	1600	188	1500	1500	300
d	1066	72	980	875	138
e	514	113	504	605	153
f	20	3	16	20	9
h	0	0	NU ESTE CAZUL	NU ESTE CAZUL	0
i	0	0	FOARTE BUNA	FOARTE BUNA	0
g1	111	46	174	128	18
g2	86	14	51	126	29
g3	29	5	34	43	7
g4	15	0	0	2	1
g5	176	28	162	153	62
g6	18	4	19	27	10
g7	2	3	2	6	1
g8	25	5	29	55	11
g9	6	0	2	3	3
g10	3	0	3	1	2
g11	3	0	3	2	2
g12	1	0	2	3	2
g13	2	0	0	1	0
g14	18	0	4	21	3
g15	9	2	9	12	1
g16	10	6	10	22	1

Joining the results with the presence data

Some code cleanup are neede. In order to join the two dataframes we need to make slight conversions to make all the keys from both side match.

Code

df_results.loc[df_results.Județ == "STRĂINĂTATE", "Județ"] = "STRAINATATE"
df_results.loc[df_results.Uat == "OMAN", "Uat"] = "SULTANATUL OMAN"
df_results.loc[df_results.Județ == "SECTOR 1", "Județ"] = "MUNICIPIUL BUCUREŞTI"
df_results.loc[df_results.Județ == "SECTOR 2", "Județ"] = "MUNICIPIUL BUCUREŞTI"
df_results.loc[df_results.Județ == "SECTOR 3", "Județ"] = "MUNICIPIUL BUCUREŞTI"
df_results.loc[df_results.Județ == "SECTOR 4", "Județ"] = "MUNICIPIUL BUCUREŞTI"
df_results.loc[df_results.Județ == "SECTOR 5", "Județ"] = "MUNICIPIUL BUCUREŞTI"
df_results.loc[df_results.Județ == "SECTOR 6", "Județ"] = "MUNICIPIUL BUCUREŞTI"

Now, if we merge the two we will get a single big dataframe with the same number of rows but double the columns.

Code

df_precint_with_results = pd.merge(left=df_precints, right=df_results, left_on=["county_name", "uat_name", "precinct_nr"], right_on=["Județ", "Uat", "Nr"])
df_precint_with_results.shape

(19171, 67)

Let’s print one example of how one entry this looks like in practice.

Code

dict(df_precint_with_results.iloc[0])

{'liste_permanente': 696,
 'lista_suplimentare': 63,
 'total': 759,
 'urna_mobila': 0,
 'county_code': 'VS',
 'county_name': 'VASLUI',
 'id_county': '39',
 'id_locality': '9015',
 'id_precinct': '16128',
 'id_uat': '2936',
 'initial_count': 1470,
 'latitude': '46.6389853639550',
 'locality_name': 'VASLUI',
 'longitude': '27.7326775437114',
 'medium': 'U',
 'precinct_name': 'CASA DE CULTURĂ A SINDICATELOR ”CONSTANTIN TĂNASE”',
 'precinct_nr': 1,
 'presence': 51.6327,
 'siruta': '161954',
 'uat_code': None,
 'uat_name': 'MUNICIPIUL VASLUI',
 'men_18_24': 13,
 'men_25_34': 31,
 'men_35_44': 66,
 'men_45_64': 110,
 'men_65+': 115,
 'women_18_24': 22,
 'women_25_34': 34,
 'women_35_44': 78,
 'women_45_64': 171,
 'women_65+': 119,
 'Cod birou electoral': 39,
 'Județ': 'VASLUI',
 'Uat': 'MUNICIPIUL VASLUI',
 'Localitate': 'VASLUI',
 'Secție': 'CASA DE CULTURĂ A SINDICATELOR ”CONSTANTIN TĂNASE”',
 'Nr': 1,
 'Tip': 'Europarlamentare',
 'a': 1470,
 'a1': 1470,
 'a2': 0,
 'b': 759,
 'b1': 695,
 'b2': 0,
 'b3': 64,
 'c': 1500,
 'd': 741,
 'e': 741,
 'f': 18,
 'h': 0,
 'i': 0,
 'g1': 185,
 'g2': 232,
 'g3': 51,
 'g4': 0,
 'g5': 118,
 'g6': 37,
 'g7': 2,
 'g8': 68,
 'g9': 0,
 'g10': 4,
 'g11': 5,
 'g12': 3,
 'g13': 2,
 'g14': 18,
 'g15': 9,
 'g16': 7}

We will also join the data with the all_presence.csv file.

Code

df_full = pd.merge(left=df_precint_with_results, right=_all_df, left_on=["county_code", "uat_name", "precinct_nr"], right_on=["Judet", "UAT", "Nr sectie de votare"])
df_full.shape

(19171, 295)

Applying the legend

Some of the columns in the above dataframe are not quite obvious (g1, .., g16, etc..). These are party names that I was only able to find in a legend in the dropdown of a button in the UI of the site. I’ve copied it here, along with explanations of some fields that I’ve been able to figure out by looking over the PDF’s of official scanned documents.

We also need to convert these column names into more meaningfull lables.

Code

df_full.rename(columns={
    "g1": "PSD",
    "g2": "USR-PLUS",
    "g3": "PRO Romania",
    "g4": "UDMR",
    "g5": "PNL",
    "g6": "ALDE",
    "g7": "PRODEMO",
    "g8": "PMP",
    "g9": "Partidul Socialist Roman",
    "g10": "Partidul Social Democrat Independent",
    "g11": "Partidul Romania Unita",
    "g12": "Uniunea Nationala Pentur Progresul Romaniei",
    "g13": "Blocul Unitatii Nationale",
    "g14": "Gregoriana-Carmen Tudoran",
    "g15": "George-Nicaolae Simion",
    "g16": "Peter Costea",
    "a": "Total alegatori",
    "a1": "Total lista permanenta",
    "a2": "Total urna mobila",
    "b": "Total prezenti",
    "b1": "Prezenti lista permanenta",
    "b2": "Prezenti urna mobila",
    "b3": "Prezenti lista suplimentara",
    "c": "Total voturi",
    "d": "Voturi nefolosite",
    "e": "Voturi valabile",
    "f": "Voturi anulate",
    "h": "Contestatii",
    "i": "Starea sigiliilor"
}, inplace=True)

Ok, let’s check for the amount of missing data

Code

na_series = df_full.isna().sum()
na_series[na_series != 0]

latitude               479
longitude              479
siruta                 441
uat_code             18730
Contestatii             19
Starea sigiliilor        8
Siruta                 441

Removing duplicate columns

Because we’ve basically merged two types of presence datasets (the per-county one and the all_presence.csv one) we ended up with some duplicate columns in the joined dataframe. We also have as duplicates the join on columns, and columns that contained the same type of information.

We want to eliminate those. We will find the duplicated columns by:

using the pandas.duplicated method (used on the transposed matix - duplicated only works on rows)
looking at the correlation matrix of the resulting columns and get the pairs of columns that have the highes correlation.

Code

duplicate_columns = df_full.columns[df_full.T.duplicated(keep=False)]
duplicate_columns

Index(['liste_permanente', 'lista_suplimentare', 'total', 'urna_mobila',
       'county_code', 'county_name', 'initial_count', 'locality_name',
       'medium', 'precinct_name', 'precinct_nr', 'uat_name', 'men_18_24',
       'men_25_34', 'men_35_44', 'men_45_64', 'men_65+', 'women_18_24',
       'women_25_34', 'women_35_44', 'women_45_64', 'women_65+', 'Județ',
       'Uat', 'Nr', 'Judet', 'UAT', 'Localitate_y', 'Nr sectie de votare',
       'Nume sectie de votare', 'Mediu', 'Votanti lista', 'LP', 'LS', 'UM',
       'LT', 'Barbati 18-24', 'Barbati 25-34', 'Barbati 35-44',
       'Barbati 45-64', 'Barbati 65+', 'Femei 18-24', 'Femei 25-34',
       'Femei 35-44', 'Femei 45-64', 'Femei 65+', 'Barbati 104', 'Barbati 106',
       'Barbati 108', 'Barbati 109', 'Barbati 110', 'Barbati 112',
       'Barbati 113', 'Barbati 114', 'Barbati 115', 'Barbati 116',
       'Barbati 117', 'Barbati 118', 'Barbati 119', 'Barbati 120', 'Femei 105',
       'Femei 106', 'Femei 107', 'Femei 108', 'Femei 110', 'Femei 111',
       'Femei 112', 'Femei 113', 'Femei 114', 'Femei 115', 'Femei 116',
       'Femei 117', 'Femei 118', 'Femei 119', 'Femei 120'],
      dtype='object')

With these, we will compare each with each and see what searies are equals. This will results in a long list of pairs of columns that are duplicates of one another.

Code

_pairs = set()
for i, _a in enumerate(duplicate_columns):
    for _b in duplicate_columns[i+1:]:
        if (df_full[_a] == df_full[_b]).all():
            _pairs.add(tuple(sorted([_a, _b])))
_pairs

{('Barbati 104', 'Barbati 106'),
 ('Barbati 104', 'Barbati 108'),
 ('Barbati 104', 'Barbati 109'),
 ...
 ('Barbati 120', 'Femei 120'),
 ('Barbati 18-24', 'men_18_24'),
 ('Barbati 25-34', 'men_25_34'),
 ('Barbati 35-44', 'men_35_44'),
 ('Barbati 45-64', 'men_45_64'),
 ('Barbati 65+', 'men_65+'),
 ('Femei 105', 'Femei 106'),
 ('Femei 105', 'Femei 107'),
 ...
 ('Femei 119', 'Femei 120'),
 ('Femei 18-24', 'women_18_24'),
 ('Femei 25-34', 'women_25_34'),
 ('Femei 35-44', 'women_35_44'),
 ('Femei 45-64', 'women_45_64'),
 ('Femei 65+', 'women_65+'),
 ('Judet', 'county_code'),
 ('Județ', 'county_name'),
 ('LP', 'liste_permanente'),
 ('LS', 'lista_suplimentare'),
 ('LT', 'total'),
 ('Localitate_y', 'locality_name'),
 ('Mediu', 'medium'),
 ('Nr', 'Nr sectie de votare'),
 ('Nr', 'precinct_nr'),
 ('Nr sectie de votare', 'precinct_nr'),
 ('Nume sectie de votare', 'precinct_name'),
 ('UAT', 'Uat'),
 ('UAT', 'uat_name'),
 ('UM', 'urna_mobila'),
 ('Uat', 'uat_name'),
 ('Votanti lista', 'initial_count')}

There’s only one more step that we need to do: find the groups of columns that have the same information. There are cases where the columns are not only duplicated but triplicated, which results in (A == B), (B == C), (C == A) pairs in the analisys above.

This is the perfect job of the disjoint-set datastructure.

Code

_groups = DisjointSets()
for _a, _b in _pairs:
    _groups.union(_a, _b)
_groups.sets()

[['Barbati 109',
  'Femei 118',
  'Femei 110',
  'Femei 114',
  'Barbati 104',
  'Femei 115',
  'Barbati 116',
  'Barbati 117',
  'Barbati 114',
  'Femei 107',
  'Femei 119',
  'Femei 105',
  'Barbati 115',
  'Femei 108',
  'Barbati 118',
  'Barbati 108',
  'Barbati 119',
  'Femei 120',
  'Femei 116',
  'Barbati 112',
  'Femei 113',
  'Barbati 113',
  'Barbati 120',
  'Femei 117',
  'Barbati 106',
  'Barbati 110',
  'Femei 106',
  'Femei 112',
  'Femei 111'],
 ['LP', 'liste_permanente'],
 ['Femei 35-44', 'women_35_44'],
 ['LT', 'total'],
 ['Nr sectie de votare', 'precinct_nr', 'Nr'],
 ['UM', 'urna_mobila'],
 ['Mediu', 'medium'],
 ['Barbati 65+', 'men_65+'],
 ['Barbati 35-44', 'men_35_44'],
 ['Femei 18-24', 'women_18_24'],
 ['Votanti lista', 'initial_count'],
 ['Femei 25-34', 'women_25_34'],
 ['Barbati 25-34', 'men_25_34'],
 ['UAT', 'Uat', 'uat_name'],
 ['Barbati 18-24', 'men_18_24'],
 ['Barbati 45-64', 'men_45_64'],
 ['Localitate_y', 'locality_name'],
 ['Femei 45-64', 'women_45_64'],
 ['Nume sectie de votare', 'precinct_name'],
 ['Judet', 'county_code'],
 ['Femei 65+', 'women_65+'],
 ['Județ', 'county_name'],
 ['LS', 'lista_suplimentare']]

From the list above we know we choose to drop the following columns:

Code

drop_columns = ['Barbati 104', 'Barbati 106',
       'Barbati 108', 'Barbati 109', 'Barbati 110', 'Barbati 112',
       'Barbati 113', 'Barbati 114', 'Barbati 115', 'Barbati 116',
       'Barbati 117', 'Barbati 118', 'Barbati 119', 'Barbati 120', 'Femei 105',
       'Femei 106', 'Femei 107', 'Femei 108', 'Femei 110', 'Femei 111',
       'Femei 112', 'Femei 113', 'Femei 114', 'Femei 115', 'Femei 116',
       'Femei 117', 'Femei 118', 'Femei 119', 'Femei 120', 'LP', 'Femei 35-44', 'LT', 'Nr sectie de votare', 'Nr', 'UM', 
       'Mediu', 'Barbati 65+', 'Barbati 35-44','Femei 18-24',  'initial_count', 'Femei 25-34', 'Barbati 25-34', 
       'UAT', 'Uat', 'Barbati 18-24', 'Barbati 45-64', 'Localitate_y', 'Femei 45-64', 'Femei 45-64', 'Nume sectie de votare',
       'Judet', 'Femei 65+', 'Județ', 'LS', 
]

df_final = df_full.drop(columns=drop_columns)
df_final.columns

Index(['liste_permanente', 'lista_suplimentare', 'total', 'urna_mobila',
       'county_code', 'county_name', 'id_county', 'id_locality', 'id_precinct',
       'id_uat',
       ...
       'Femei 96', 'Femei 97', 'Femei 98', 'Femei 99', 'Femei 100',
       'Femei 101', 'Femei 102', 'Femei 103', 'Femei 104', 'Femei 109'],
      dtype='object', length=242)

And we end up with..

Code

df_final.head().T

	0	1	2	3	4
liste_permanente	696	140	501	571	680
lista_suplimentare	63	10	25	41	55
total	759	150	526	612	736
urna_mobila	0	0	0	0	1
county_code	VS	VS	VS	VS	VS
county_name	VASLUI	VASLUI	VASLUI	VASLUI	VASLUI
id_county	39	39	39	39	39
id_locality	9015	9015	9006	9006	9006
id_precinct	16128	16187	16086	16087	16088
id_uat	2936	2936	2933	2933	2933
latitude	46.6389853639550	46.6421141774663	46.2240238056566	46.2278431009305	46.2278431009305
locality_name	VASLUI	VASLUI	BÂRLAD	BÂRLAD	BÂRLAD
longitude	27.7326775437114	27.7289502189002	27.6775710052581	27.6686353095150	27.6686353095150
medium	U	U	U	U	U
precinct_name	CASA DE CULTURĂ A SINDICATELOR ”CONSTANTIN TĂN...	ȘCOALA GIMNAZIALĂ ”CONSTANTIN PARFENE”	ŞCOALA GIMNAZIALĂ ”VICTOR IOAN POPA”	CASA DE CULTURĂ A SINDICATELOR ”GEORGE TUTOVEANU”	CASA DE CULTURĂ A SINDICATELOR ”GEORGE TUTOVEANU”
precinct_nr	1	10	100	101	102
presence	51.6327	8.1522	38.8479	44.5091	46.879
siruta	161954	161954	161801	161801	161801
uat_code	None	None	None	None	None
uat_name	MUNICIPIUL VASLUI	MUNICIPIUL VASLUI	MUNICIPIUL BÂRLAD	MUNICIPIUL BÂRLAD	MUNICIPIUL BÂRLAD
men_18_24	13	4	18	15	11
men_25_34	31	5	25	29	40
men_35_44	66	17	60	52	60
men_45_64	110	28	104	113	158
men_65+	115	20	43	72	81
women_18_24	22	6	17	22	14
women_25_34	34	8	33	32	46
women_35_44	78	15	64	55	52
women_45_64	171	28	117	127	178
women_65+	119	19	45	95	96
...	...	...	...	...	...
Femei 76	5	2	0	7	2
Femei 77	4	0	2	0	0
Femei 78	7	0	2	2	3
Femei 79	4	1	1	2	3
Femei 80	5	0	3	1	1
Femei 81	4	0	0	3	0
Femei 82	2	0	0	4	3
Femei 83	1	0	1	2	1
Femei 84	0	1	0	3	1
Femei 85	1	0	0	1	0
Femei 86	2	1	0	1	0
Femei 87	1	1	0	0	2
Femei 88	2	1	0	0	0
Femei 89	0	0	0	0	1
Femei 90	0	0	0	0	0
Femei 91	0	0	0	0	0
Femei 92	0	0	0	0	0
Femei 93	0	0	0	0	0
Femei 94	0	0	0	0	0
Femei 95	0	0	0	0	0
Femei 96	0	0	0	0	0
Femei 97	0	0	0	0	0
Femei 98	0	0	0	0	0
Femei 99	0	0	0	0	0
Femei 100	0	0	0	0	0
Femei 101	0	0	0	0	0
Femei 102	0	0	0	0	0
Femei 103	0	0	0	0	0
Femei 104	0	0	0	0	0
Femei 109	0	0	0	0	0

242 rows × 5 columns

Save the data to csv format

We’re almost done. We only need to save the dataset on disk and start using it (to be continued, in a future post)!

df_final.to_csv("_data/final.csv")

Twitter Facebook LinkedIn

C. Lungu

Getting data of an election

Getting data of an election

Understanding the API

Fetching the data

Compiling the data

Loading a `presence` file

Getting all the individual `result` files for all `precint_nr`

Reading all the presence data into a single `DataFrame`

Reading all the results data into a single dataframe.

Joining the results with the presence data

Applying the legend

Removing duplicate columns

Save the data to csv format

Comments

You May Also Enjoy

Finding substrings in a numpy vectorized function

Connecting Colab to a local machine

Visualising grid search results

A timer context manager in python

C. Lungu

Getting data of an election

Understanding the API

Fetching the data

Compiling the data

Loading a presence file

Getting all the individual result files for all precint_nr

Reading all the presence data into a single DataFrame

Reading all the results data into a single dataframe.

Joining the results with the presence data

Applying the legend

Removing duplicate columns

Save the data to csv format

Comments

You May Also Enjoy

Finding substrings in a numpy vectorized function

Connecting Colab to a local machine

Visualising grid search results

A timer context manager in python

Loading a `presence` file

Getting all the individual `result` files for all `precint_nr`

Reading all the presence data into a single `DataFrame`