duckdb导入shp文件数据

duckdb操作地理数据

  1. GEOJSON文件导入各大数据库的方式
  2. GPKG文件导入数据库

网购边界数据为shp格式文件,咨询gpt可以使用python的geopandas库进行解析:

1
2
3
4
5
6
7
8
import geopandas as gpd

# 读取Shapefile文件
shp_file = "村庄面1.shp"
gdf = gpd.read_file(shp_file)

# 查看GeoDataFrame的前几行
gdf.head()

耗时1m12s,输出:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21

   FID              CJDCQDM     CJDCQMC        DCMJ        JSMJ MSSM  HDMC  \
0  0.0  1201020010010000000         XX村   971721.66   971721.66   00  None   
1  1.0  1201020010020000000         XX村  1955768.00  1955767.99   00  None   
2  2.0  1201020010030000000         XX村  1278073.65  1278073.65   00  None   
3  3.0  1201020010040000000         XX村  1381409.38  1381409.37   00  None   
4  4.0  1201020029990000000         XX村  1936147.52  1936147.51   00  None   

     BZ    XZXZQDM XZXZQMC XJXZQDM XJXZQMC SJXZQDM SJXZQMC  \
0  None  120102001    XX街道  120102     XX区    1201     XX市   
1  None  120102001    XX街道  120102     XX区    1201     XX市   
2  None  120102001    XX街道  120102     XX区    1201     XX市   
3  None  120102001    XX街道  120102     XX区    1201     XX市   
4  None  120102002    XX街道  120102     XX区    1201     XX市   

                                            geometry  
0  POLYGON ((12584777.005 3274082.043, 12584782.0...  
1  POLYGON ((12584405.990 3274961.921, 12584406.1...  
2  POLYGON ((12586913.723 3274871.901, 12586913.8...  
3  POLYGON ((12585892.150 3273180.798, 12585892.2...  
4  POLYGON ((12583490.814 3274392.966, 12583522.3...  

我的目的最终还是入库,这样方便利用数据库内置的函数进行查询。 duckdb通过SPATIAL扩展(官方使用手册)可以非常非常方便的入库,且效率是geopandas的14倍(R7-6800H,16GB,SSD上的结果):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
E:\pgis>duckdb
v0.10.0 20b1486d11
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D .open database.db
D INSTALL spatial;
D LOAD spatial;
D .timer on
D create table geotable as select * from '村庄面1.shp' limit 10;
100% ▕████████████████████████████████████████████████████████████▏
Run Time (s): real 4.983 user 0.234375 sys 0.812500

duckdb除了能导入shp文件外,通过ST_Read函数还能直接导入 50 多种不同格式的地理空间文件。 前面的select from shp实际是ST_Read函数的语法糖,也就是说其底层调用了ST_Read

试图通过查询来验证数据的真实性:

1
2
3
4
5
6
7
8
D select * from geotable where  ST_Contains(geom,ST_Point(152.60,28.30));
100% ▕████████████████████████████████████████████████████████████▏
┌────────┬─────────┬─────────┬────────┬────────┬───┬─────────┬─────────┬─────────┬─────────┬─────────┬──────────┐
  FID    CJDCQDM  CJDCQMC   DCMJ    JSMJ     XZXZQMC  XJXZQDM  XJXZQMC  SJXZQDM  SJXZQMC    geom   
 double  varchar  varchar  double  double     varchar  varchar  varchar  varchar  varchar  geometry 
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
                                                    0 rows                                                     
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

未命中结果….,以为是函数不对,又使用另一个判断点是否在多边形内的函数select * from geotable where ST_Within(ST_Point(152.60,28.30),geom);,还是没有命中的记录, 一瞬间我还以为是假数据。select geom列,结果是POLYGON ((12584777.005303478 3274082.042692696, 12584782.035163349 3274022.8823635867, 12584783.106613439 3274010.28238…,明显不是152.22,28.30这中坐标对,我估计是坐标系不对。 经过查看与shp文件同目录的prj文件及GPT介绍,发现shp中储存的坐标参考坐标系是WGS 1984 Web Mercator Auxiliary Sphere投影,我提供的点是WGS 84坐标系

prj文件内容
PROJCS[“WGS_1984_Web_Mercator_Auxiliary_Sphere”, GEOGCS[“GCS_WGS_1984”,DATUM[“D_WGS_1984”, SPHEROID[“WGS_1984”,6378137.0,298.257223563]], PRIMEM[“Greenwich”,0.0], UNIT[“Degree”,0.0174532925199433]], PROJECTION[“Mercator_Auxiliary_Sphere”], PARAMETER[“False_Easting”,0.0], PARAMETER[“False_Northing”,0.0], PARAMETER[“Central_Meridian”,0.0], PARAMETER[“Standard_Parallel_1”,0.0], PARAMETER[“Auxiliary_Sphere_Type”,0.0], UNIT[“Meter”,1.0]]
于是查询duckdb使用手册,其有一个函数可进行坐标系转换。

1
2
select ST_Transform(geom, 'EPSG:4326', 'EPSG:3857') from geotable limit 1;
-- POLYGON ((28.19896564100002 113.05097531100013, 28.19849727000008 113.05102049500012, 28.198397516000057 113.0510301200…

可以看到正常坐标对了。

1
UPDATE geotable SET geom = ST_Transform(geom, 'EPSG:3857', 'EPSG:4326');

再进行查询就正常了。

合并社区得到街道的边界数据:以街道名称分组,然后将分组数据合并。

1
SELECT  XZXZQMC, ST_Union_Agg(geom) FROM geotable GROUP BY XZXZQMC;
1
SELECT  XZXZQMC, ST_Union_Agg(geom) FROM geotable WHERE XZXZQMC = 'XXX' GROUP BY XZXZQMC;

duckdb官方手册中罗列了ST_Transform函数,但是未进行功能说明,我丢给GPT后,GPT告诉我可以使用来进行坐标系转换,包括ST_ContainsST_WithinST_Point都是自己凭经验和字面意思来理解其作用的,我估计这是地理空间处理领域的通用函数名。但是其中ST_Transform的参数标识符EPSG:3857EPSG:4326,我不知道是哪里来的。 仔细阅读官方手册,其中有一段很重要的描述

重要描述

The spatial extension implements a large number of scalar functions and overloads. Most of these are implemented using the GEOS library, but we’d like to implement more of them natively in this extension to better utilize DuckDB’s vectorized execution and memory management. The following symbols are used to indicate which implementation is used….

🧭 - GEOS - functions that are implemented using the GEOS library

🦆 - DuckDB - functions that are implemented natively in this extension that are capable of operating directly on the DuckDB types

🔄 - CAST(GEOMETRY) - functions that are supported by implicitly casting to GEOMETRY and then using the GEOMETRY implementation

Scalar functions GEOMETRY POINT_2D LINESTRING_2D POLYGON_2D BOX_2D
GEOMETRY ST_Point(DOUBLE, DOUBLE) 🦆 🦆      
GEOMETRY ST_ConvexHull(GEOMETRY) 🧭 🔄 🔄 🔄 🔄 (as POLYGON)
GEOMETRY ST_Boundary(GEOMETRY) 🧭 🔄 🔄 🔄 🔄 (as POLYGON)
GEOMETRY ST_Buffer(GEOMETRY) 🧭 🔄 🔄 🔄 🔄 (as POLYGON)
GEOMETRY ST_Centroid(GEOMETRY) 🧭 🦆 🦆 🦆 🦆
GEOMETRY ST_Collect(GEOMETRY[]) 🦆 🦆 🦆 🦆 🦆
GEOMETRY ST_Normalize(GEOMETRY) 🧭 🔄 🔄 🔄 🔄 (as POLYGON)
GEOMETRY ST_SimplifyPreserveTopology(GEOMETRY, DOUBLE) 🧭 🔄 🔄 🔄 🔄 (as POLYGON)
GEOMETRY ST_Simplify(GEOMETRY, DOUBLE) 🧭 🔄 🔄 🔄 🔄 (as POLYGON)
GEOMETRY ST_Union(GEOMETRY, GEOMETRY) 🧭 🔄 🔄 🔄 🔄 (as POLYGON)
GEOMETRY ST_Intersection(GEOMETRY, GEOMETRY) 🧭 🔄 🔄 🔄 🔄 (as POLYGON)
GEOMETRY ST_MakeLine(GEOMETRY[]) 🦆   🦆    
GEOMETRY ST_Envelope(GEOMETRY) 🧭 🔄 🔄 🔄 🔄 (as POLYGON)
GEOMETRY ST_FlipCoordinates(GEOMETRY) 🦆 🦆 🦆 🦆 🦆
GEOMETRY ST_Transform(GEOMETRY, VARCHAR, VARCHAR) 🦆 🦆 🦆 🦆 🦆
BOX_2D ST_Extent(GEOMETRY) 🦆 🦆 🦆 🦆 🦆
GEOMETRY ST_PointN(GEOMETRY, INTEGER) 🦆   🦆    
GEOMETRY ST_StartPoint(GEOMETRY) 🦆   🦆    
GEOMETRY ST_EndPoint(GEOMETRY) 🦆   🦆    
GEOMETRY ST_ExteriorRing(GEOMETRY) 🦆     🦆  
GEOMETRY ST_Reverse(GEOMETRY) 🧭 🔄 🔄 🔄 🔄
GEOMETRY ST_RemoveRepeatedPoints(GEOMETRY) 🧭 🔄 🔄 🔄 🔄 (as POLYGON )
GEOMETRY ST_RemoveRepeatedPoints(GEOMETRY, DOUBLE) 🧭 🔄 🔄 🔄 🔄 (as POLYGON )
GEOMETRY ST_ReducePrecision(GEOMETRY, DOUBLE) 🧭 🔄 🔄 🔄 🔄 (as POLYGON )
GEOMETRY ST_PointOnSurface(GEOMETRY) 🧭 🔄 🔄 🔄 🔄 (as POLYGON)
GEOMETRY ST_CollectionExtract(GEOMETRY) 🦆        
GEOMETRY ST_CollectionExtract(GEOMETRY, INTEGER) 🦆        
可以看到ST_Transform函数都是🦆,说明是duckdb自创的,此处不得不为duckdb官方的辛劳付出点赞,而ST_Transform的用法只能看官方仓库示例博客示例得知。

目前duckdb操作地理空间的函数缺乏文档来说明使用方法,已经有人提出issue,官方已经说会出,但目前仍未出。