百度API补完厦门二手房经纬度并找到最近的地铁站

经纬度是二手房的重要信息,有了这些数据我们才能计算房子到最近地铁站的距离,轨道交通是影响房价很重要的数据。贝壳二手房详情页上的房子位置信息是js加载的,我们不能直接拿到,所以只能找百度API。

系列文章:

一、百度地理编码服务

百度地理编码服务可以通过地名查询经纬,也可以通过经纬查地名,我们要的是前者。

这是说明和文档地址:百度地理编码服务,已经更新到第三版:

百度API补完厦门二手房经纬度并找到最近的地铁站

需要申请成为百度开发者才能调用,申请过程不赘述,申请完创建应用,有两种校验方式:

  • sn校验方式
  • IP白名单校验
百度API补完厦门二手房经纬度并找到最近的地铁站

我用的是前者,百度文档sn计算方法给到还是python2的示例,我这分享python3的示例:

# 导包
import json
import hashlib
from urllib import parse
import requests

def get_geographic_location(location_name):
    """
        百度地理编码服务API调用方法封装
        :param location_name: str, 完整地名
        :return: obj, {'lng': 0.0000..., 'lat': 0.0000...} or None
    """
    # 你的应用ak和sk
    _ak = ''
    _sk = ''

    base_url = 'http://api.map.baidu.com'
    # v3版的query_str
    query_str = '/geocoding/v3/?address=%s&output=json&ak=%s&callback=showLocation' % (location_name, _ak)
    # 对queryStr进行转码,safe内的保留字符不转换
    encoded_str = parse.quote(query_str, safe="/:=&?#+!$,;'@()*[]")
    # 在最后直接追加上sk
    raw_str = encoded_str + _sk
    # md5计算sn值
    sn = hashlib.md5(parse.quote_plus(raw_str).encode('utf-8')).hexdigest()
    # 拼接上sn参数
    url = base_url + query_str + "&sn=" + sn
    # 请求
    response = requests.request('get', url)

    # 返回示例 : 'showLocation&&showLocation({"status":0,"result":{"location":{"lng":118.14537195003412,"lat":24.492264725782975},"precise":1,"confidence":70,"comprehension":100,"level":"地产小区"}})'

    # 尝试解析
    if response.status_code == 200:
        try:
            json_text = response.text.split(')')[0].split('(').pop()
            result = json.loads(json_text)['result']['location']
        except KeyError:
            result = None
    else:
        result = None
    return result

二、准备工作

从表格文件读取数据并添加字段用于保存地名:

#导包
import pandas as pd

#从csv读取数据
data_clean = pd.read_csv('data_clean.csv', index_col=0)

# 添加 location_name 字段
    data_clean['location_name'] = '厦门市' + data_clean['district'] + '区' + data_clean['sub_district'] + data_clean['resblock']

多套房子可能在同个小区,所以地名很多重复的,要先去重,避免API浪费:

# 导出完整地名(小区名)到locations表,并增加空字段'lng'和'lat'用来保存经纬度:
locations = pd.DataFrame({'location_name': '厦门市' + data_clean['district'] + '区' + data_clean['sub_district'] + data_clean['resblock'],
                              'lng': 0.0,
                              'lat': 0.0,
                              })
# 去重
locations = locations.drop_duplicates()
# locations.to_csv('data_clean.csv')
# 重建索引
locations = locations.reset_index(drop=True)

还要准备地铁站和BRT站表格stations.xlsx,厦门市数据示例,需要请留言:

location_nametypelinedistrictlnglat
0第一码头BRT站快1线思明0.000000000.00000000
1开禾路口BRT站快1线思明0.000000000.00000000
2思北BRT站快1线思明0.000000000.00000000
3斗西路BRT站快1线思明0.000000000.00000000
4二市BRT站快1线思明0.000000000.00000000

三、调用API获得经纬度

获得房子的经纬信息:

for i in range(0, len(locations)):
    if locations.loc[i]['lng'] == 0 or locations.loc[i]['lat'] == 0:
        location_name = locations.loc[i]['location_name']
        lng_and_lat = get_geographic_location(location_name)
        if lng_and_lat:
            locations.at[i, 'lng'] = float(lng_and_lat['lng'])
            locations.at[i, 'lat'] = float(lng_and_lat['lat'])
            # 限制访问频率
        time.sleep(0.1)
# 保存备用
# locations.to_csv('locations.csv')

获得厦门地铁站/BRT站的经纬信息:

    stations = pd.read_csv("stations.csv", index_col=0)
    stations[['lng', 'lat']] = stations[['lng', 'lat']].astype("float64")

    for i in range(0, len(stations)):
        if stations.loc[i]['lng'] == 0 or stations.loc[i]['lat'] == 0:
            location_name = '厦门市' + stations.loc[i]['location_name'] + stations.loc[i]['type']
            lng_and_lat = get_geographic_location(location_name)
            if lng_and_lat:
                stations.at[i, 'lng'] = float(lng_and_lat['lng'])
                stations.at[i, 'lat'] = float(lng_and_lat['lat']
            # 限制访问频率
            time.sleep(0.1)

    # 保存备用
    stations.to_csv('stations.csv')
    # 查看
    print(stations[0: 5])

打印看看,经纬度有了:

location_name type line district lng lat
0 第一码头 BRT站 快1线 思明 118.077979 24.466364
1 开禾路口 BRT站 快1线 思明 118.081101 24.464567
2 思北 BRT站 快1线 思明 118.083741 24.461627
3 斗西路 BRT站 快1线 思明 118.091966 24.471916
4 二市 BRT站 快1线 思明 118.096435 24.485407

四、给每套房子找到最近的地铁/BRT站

有了经纬度我们就能计算房子到地铁站的直线距离:

    # 导入根据经纬计算直线距离的库
    from geopy.distance import great_circle

    # stations去重, 去除换乘站
    stations = pd.read_csv("stations.csv", index_col=0)
    stations = stations.drop_duplicates(subset=['location_name', 'type'], keep='first')

    # 合并locations和stations表的经纬形成点
    locations = pd.read_csv('locations.csv', index_col=0)
    def point(x, y):
        return (x, y)
    locations['point'] = locations.apply(lambda row: point(row['lat'], row['lng']), axis=1)
    stations['point'] = stations.apply(lambda row: point(row['lat'], row['lng']), axis=1)

    # 遍历locations表,给每个小区找到最近的地铁或BRT站,并计算相应的距离
    # 先重建索引并插入新列
    stations.reset_index(drop=True, inplace=True)
    locations.reset_index(drop=True, inplace=True)
    locations['nearest_station'] = ''
    locations['distance_to_station'] = 0

    for i in range(0, len(locations)):
        nearest_station = ''
        distance_to_station = 9999999999999
        for j in range(0, len(stations)):
            point_resblock = locations.loc[i]['point']
            point_station = stations.loc[j]['point']
            distance = great_circle(point_resblock, point_station).m
            if distance < distance_to_station:
                distance_to_station = distance
                nearest_station = stations.loc[j]['location_name'] + stations.loc[j]['type']
        # 把最近站点和距离写入locations表
        locations.at[i, 'nearest_station'] = nearest_station
        locations.at[i, 'distance_to_station'] = distance_to_station
    # 保存到locations.csv备用
    locations.to_csv('locations.csv')

五、链接表

左外链接合并data_clean和locations表,吧lng/lat/nearest_station/distance_to_station/这四个字段插到data_clean表:
# 左外链接合并 data_clean 和 locations 表得到的经纬度的data_clean_with_lng_lat
data_clean_with_stations= pd.merge(data_clean, locations, how='left', left_on='location_name', right_on='location_name')
data_clean_with_stations.to_csv('data_clean_with_stations.csv')

这样房子表的信息就完整了,看看有哪些字段:

data_clean_with_stations.info()
Int64Index: 17298 entries, 0 to 17297
Data columns (total 30 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 _id 17298 non-null int64
1 area 17298 non-null float64
2 building_floors 17294 non-null float64
3 building_structure 16814 non-null object
4 building_type 16359 non-null object
5 direction 17298 non-null object
6 district 17298 non-null object
7 elevators 16856 non-null float64
8 floors 17294 non-null object
9 house_per_floor 16856 non-null float64
10 house_structure 14887 non-null object
11 mortgage 14567 non-null float64
12 ownership 17298 non-null object
13 price 17298 non-null float64
14 property_right_ownership 17298 non-null object
15 remodel 17298 non-null object
16 resblock 17298 non-null object
17 rooms 17292 non-null object
18 sub_district 17298 non-null object
19 time_last_transaction 14752 non-null object
20 time_on_sold 17298 non-null object
21 unit_price 17298 non-null float64
22 usage 17298 non-null object
23 with_elevators 17298 non-null bool
24 year_build 17186 non-null float64
25 location_name 17298 non-null object
26 lng 17298 non-null float64
27 lat 17298 non-null float64
28 nearest_station 17298 non-null object
29 distance_to_station 17298 non-null int64
dtypes: bool(1), float64(10), int64(2), object(17)
memory usage: 4.0+ MB

终于可以开始分析厦门二手房的数据了,目前有这些想法:

# 统计不同年份建成套数,面积占比,能否一窥厦门城市化的进程和趋势
# 统计毛坯和精装的比例,反应投资多还是自住多
# 统计最后一次交易时间和价格的关系,满5满2的差别
# 有/无电梯中高低楼层价格差多少
# 统计售价/剩余产权年限与房屋年代的关系
# 不同年代的层高
# 不同面积的负债率

开始分析:简单分析贝壳在售厦门二手房数据(一)

原创文章,作者:10bests,禁止任何形式转载:https://www.10bests.com/baidu-geocoder-api-sn/

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注