Geospatial Data Processing with Python: A Network Data Case Study
Posted on September 1, 2024 • 3 min read • 621 wordsLearn how to handle missing values and mismatched data in network datasets.
This blog post details a Python exercise mimicking a real-world scenario of transforming messy network data into a format suitable for business system.
The provided data contains:
Our objective is to generate two GeoJSON files with following specifications:
spans.geojson:
poles.geojson:
We’ll use Python geospatial libraries to accomplish the data wrangling. Here’s a breakdown of the key steps:
We’ll briefly utilise any open-source libraries that streamline the process. Some potential options include:
GeoPandas: For geospatial data manipulation. Shapely: For geometric object construction.
The final product includes a python script that creates two geojson files with aligned lines and data columns.
Necessary scripts or infrastructure to execute the code (e.g., Makefile). The two generated GeoJSON files: spans.geojson and poles.geojson. Clear instructions or scripts for easy execution (ideally within five minutes).
Import Libraries
import geopandas as gpd
from shapely.geometry import LineString, Point
import os
import pandas as pd
Find current working directory, create output file and export geojson files
#get current working directory
cwd = os.getcwd()
#create output directory if not exist
if not os.path.exists(f"{cwd}/output/"):
os.makedirs(f"{cwd}/output/")
#read poles and lines layers
poles = gpd.read_file(f'{cwd}/data/input_data.gpkg', layer= 'poles')
lines = gpd.read_file(f'{cwd}/data/input_data.gpkg', layer= 'lines')
Iterage through eachline and snap it to the nearest point - Add Point A,B columns in the lines layer
###Stage 1 snap spans to nearest poles and add Point A Point B columns
print('Snapping lines')
for index, line in lines.iterrows():
line_coords = list(line.geometry.coords)
#find nearest pole for endpoint A
end_point_1 = Point(line.geometry.coords[0]) #line endpoint A
nearest_pole_id_1 = poles.distance(end_point_1).idxmin() #nearest pole id to point A
nearest_pole_1 = poles.loc[nearest_pole_id_1] #nearest pole
lines.at[index, 'Point A'] = nearest_pole_1['poleID']
line_coords[0] = nearest_pole_1.geometry.coords[0]
#find nearest pole for endpoint B
end_point_2 = Point(line.geometry.coords[-1])
nearest_pole_id_2 = poles.distance(end_point_2).idxmin() #nearest pole id to point A
nearest_pole_2 = poles.loc[nearest_pole_id_2] #nearest pole
lines.at[index, 'Point B'] = nearest_pole_2['poleID']
line_coords[-1] = nearest_pole_2.geometry.coords[0]
#update line geometry with snapped coordinates
lines.at[index, "geometry"] = LineString(line_coords)
Identify missing values in pole layer and populate
###Stage 2 populate missing values from the nearest pole (only applicable for poles height)
#lines[lines.isna().any(axis=1)]
poles_nan = poles[poles['height'].isna()] #find nan rows
for index, pole in poles_nan.iterrows():
distances = poles.distance(pole.geometry) #calculate distances to all other poles
sorted_distances = distances[distances > 0].sort_values() #sort distances and filter out the pole itself
#check the nearest non-NaN pole
for nearest_pole_id in sorted_distances.index:
nearest_pole = poles.loc[nearest_pole_id]
if pd.notna(nearest_pole['height']):
poles.at[index, "height"] = nearest_pole['height']#assign nearest pole height
break #stop once we've assigned the nearest valid height
Save output geoJSON files
#save output files
print("Writing output")
try:
lines.to_file(f"{cwd}/output/spans.geojson")
poles.to_file(f"{cwd}/output/poles.geojson")
except Exception as e:
print(f"Error saving files: {e}")
with this exercise, learner will gain valuable experience in data cleaning, geospatial analysis and data preparation for specific systems. This skillset proves beneficial in various data-driven domains - it proves data wrangling from raw format to business compatible data format.