duckdb导入shp文件数据
duckdb操作地理数据
背景
网购边界数据为shp格式文件,咨询gpt可以使用python的geopandas库进行解析:
|
|
耗时1m12s,输出:
|
|
入库
我的目的最终还是入库,这样方便利用数据库内置的函数进行查询。
duckdb通过SPATIAL扩展(官方使用手册)可以非常非常方便的入库,且效率是geopandas的14倍(R7-6800H,16GB,SSD上的结果):
|
|
扩展知识
duckdb除了能导入shp文件外,通过ST_Read函数还能直接导入 50 多种不同格式的地理空间文件。
前面的select from shp实际是ST_Read函数的语法糖,也就是说其底层调用了ST_Read。
查询
试图通过查询来验证数据的真实性:
|
|
未命中结果….,以为是函数不对,又使用另一个判断点是否在多边形内的函数select * from geotable where ST_Within(ST_Point(152.60,28.30),geom);,还是没有命中的记录,
一瞬间我还以为是假数据。select geom列,结果是POLYGON ((12584777.005303478 3274082.042692696, 12584782.035163349 3274022.8823635867, 12584783.106613439 3274010.28238…,明显不是152.22,28.30这中坐标对,我估计是坐标系不对。
经过查看与shp文件同目录的prj文件及GPT介绍,发现shp中储存的坐标参考坐标系是WGS 1984 Web Mercator Auxiliary Sphere投影,我提供的点是WGS 84坐标系
测试坐标系转换
|
|
可以看到正常坐标对了。
转换并更新到源列
|
|
再进行查询就正常了。
合并图形
合并社区得到街道的边界数据:以街道名称分组,然后将分组数据合并。
|
|
合并指定的街道
|
|
总结
duckdb官方手册中罗列了ST_Transform函数,但是未进行功能说明,我丢给GPT后,GPT告诉我可以使用来进行坐标系转换,包括ST_Contains和ST_Within、ST_Point都是自己凭经验和字面意思来理解其作用的,我估计这是地理空间处理领域的通用函数名。但是其中ST_Transform的参数标识符EPSG:3857和EPSG:4326,我不知道是哪里来的。
仔细阅读官方手册,其中有一段很重要的描述
The spatial extension implements a large number of scalar functions and overloads. Most of these are implemented using the GEOS library, but we’d like to implement more of them natively in this extension to better utilize DuckDB’s vectorized execution and memory management. The following symbols are used to indicate which implementation is used….
🧭 - GEOS - functions that are implemented using the GEOS library
🦆 - DuckDB - functions that are implemented natively in this extension that are capable of operating directly on the DuckDB types
🔄 - CAST(GEOMETRY) - functions that are supported by implicitly casting to GEOMETRY and then using the GEOMETRY implementation
| Scalar functions | GEOMETRY | POINT_2D | LINESTRING_2D | POLYGON_2D | BOX_2D |
|---|---|---|---|---|---|
| GEOMETRY ST_Point(DOUBLE, DOUBLE) | 🦆 | 🦆 | |||
| GEOMETRY ST_ConvexHull(GEOMETRY) | 🧭 | 🔄 | 🔄 | 🔄 | 🔄 (as POLYGON) |
| GEOMETRY ST_Boundary(GEOMETRY) | 🧭 | 🔄 | 🔄 | 🔄 | 🔄 (as POLYGON) |
| GEOMETRY ST_Buffer(GEOMETRY) | 🧭 | 🔄 | 🔄 | 🔄 | 🔄 (as POLYGON) |
| GEOMETRY ST_Centroid(GEOMETRY) | 🧭 | 🦆 | 🦆 | 🦆 | 🦆 |
| GEOMETRY ST_Collect(GEOMETRY[]) | 🦆 | 🦆 | 🦆 | 🦆 | 🦆 |
| GEOMETRY ST_Normalize(GEOMETRY) | 🧭 | 🔄 | 🔄 | 🔄 | 🔄 (as POLYGON) |
| GEOMETRY ST_SimplifyPreserveTopology(GEOMETRY, DOUBLE) | 🧭 | 🔄 | 🔄 | 🔄 | 🔄 (as POLYGON) |
| GEOMETRY ST_Simplify(GEOMETRY, DOUBLE) | 🧭 | 🔄 | 🔄 | 🔄 | 🔄 (as POLYGON) |
| GEOMETRY ST_Union(GEOMETRY, GEOMETRY) | 🧭 | 🔄 | 🔄 | 🔄 | 🔄 (as POLYGON) |
| GEOMETRY ST_Intersection(GEOMETRY, GEOMETRY) | 🧭 | 🔄 | 🔄 | 🔄 | 🔄 (as POLYGON) |
| GEOMETRY ST_MakeLine(GEOMETRY[]) | 🦆 | 🦆 | |||
| GEOMETRY ST_Envelope(GEOMETRY) | 🧭 | 🔄 | 🔄 | 🔄 | 🔄 (as POLYGON) |
| GEOMETRY ST_FlipCoordinates(GEOMETRY) | 🦆 | 🦆 | 🦆 | 🦆 | 🦆 |
GEOMETRY ST_Transform(GEOMETRY, VARCHAR, VARCHAR) |
🦆 | 🦆 | 🦆 | 🦆 | 🦆 |
| BOX_2D ST_Extent(GEOMETRY) | 🦆 | 🦆 | 🦆 | 🦆 | 🦆 |
| GEOMETRY ST_PointN(GEOMETRY, INTEGER) | 🦆 | 🦆 | |||
| GEOMETRY ST_StartPoint(GEOMETRY) | 🦆 | 🦆 | |||
| GEOMETRY ST_EndPoint(GEOMETRY) | 🦆 | 🦆 | |||
| GEOMETRY ST_ExteriorRing(GEOMETRY) | 🦆 | 🦆 | |||
| GEOMETRY ST_Reverse(GEOMETRY) | 🧭 | 🔄 | 🔄 | 🔄 | 🔄 |
| GEOMETRY ST_RemoveRepeatedPoints(GEOMETRY) | 🧭 | 🔄 | 🔄 | 🔄 | 🔄 (as POLYGON ) |
| GEOMETRY ST_RemoveRepeatedPoints(GEOMETRY, DOUBLE) | 🧭 | 🔄 | 🔄 | 🔄 | 🔄 (as POLYGON ) |
| GEOMETRY ST_ReducePrecision(GEOMETRY, DOUBLE) | 🧭 | 🔄 | 🔄 | 🔄 | 🔄 (as POLYGON ) |
| GEOMETRY ST_PointOnSurface(GEOMETRY) | 🧭 | 🔄 | 🔄 | 🔄 | 🔄 (as POLYGON) |
| GEOMETRY ST_CollectionExtract(GEOMETRY) | 🦆 | ||||
| GEOMETRY ST_CollectionExtract(GEOMETRY, INTEGER) | 🦆 |
ST_Transform函数都是🦆,说明是duckdb自创的,此处不得不为duckdb官方的辛劳付出点赞,而ST_Transform的用法只能看官方仓库示例和博客示例得知。
目前duckdb操作地理空间的函数缺乏文档来说明使用方法,已经有人提出issue,官方已经说会出,但目前仍未出。