금융을 위한 대체 데이터: 범주와 사용 사례¶
대체 데이터 혁명¶
- 빅데이터의 5V: 크기, 속도, 다양성, 정확성, 가치
- 새로운 데이터 소스의 사용 사례
- 대표적인 재화와 서비스 세트에 대한 온라인 가격 데이터로 인플레이션을 측정
- 매장 방문 횟수나 구매 횟수는 회사나 산업 고유의 판매나 경제 활동의 실시간 추정을 가능하게 함
- 인공위성 이미지는 이 정보가 다른 곳에서 이용되기 전에 수확량이나 광산, 석유 굴착장에서의 활동을 포착
- 전통적 투자를 포착하고자 새로운 기회를 창출
- 모멘텀: ML은 시장 가격 변동, 산업 심리, 경제 팩터에 대한 자산 노출을 식별
- 가치: 알고리듬은 회사의 본질 가치를 예측하기 위한 재무제표 이외에 많은 자산 노출을 식별
- 퀄리티: 통합 데이터의 정교한 분석을 통해 고객 평가 혹은 직원 리뷰, 전자 상거래, 앱 트래픽이 시장 점유율 또는 기타 기초 수익 질적 팩터로 이익을 식별
- 감성: 뉴스 및 소셜 미디어 콘텐츠의 실시간 처리 및 해석을 통해 ML 알고리듬은 새로운 감성을 빠르게 감지하고 다양한 소스의 정보를 좀 더 일관된 큰 그림으로 합성
대체 데이터의 원천¶
- 개인: 소셜 미디어 게시, 상품 리뷰를 올리거나 검색 엔진을 사용하는 개인 데이터
- 비즈니스: 상업 거래를 기록(특히 신용카드 결제)하거나 중개 회사로 공급체인 활동을 포착
- 센서: 다른 많은 것 중에서 위성 사진, 보안 카메라로부터 이미지나 이동전화 기지국처럼 사람들의 움직임 패턴을 통해 경제 활동을 포착하는 센서 데이터
- 개인
- 개인은 온라인 활동을 통해 전자 데이터를 생성
- 비즈니스 프로세스
- 비즈니스 프로세스로부터 생성되는 데이터는 개인이 생성한 것보다 많은 구조를 가짐판매 시점 데이터(POS 데이터)와 같은 신용카드 거래와 기업 생성 데이터는 가장 신뢰할 만하며 예측력이 있는 데이터 세트
- 센서
- 다양한 장치에 내장된 네트워크 센서
- 인공위성
- 농업, 광물 생산, 선적, 상업용 또는 주거용 건물, 선박 건조 같은 항공 범위를 사용해 포착할 수 있는 경제 활동 모니터링
- 위치 정보 데이터
- 마케팅 활동의 영향을 측정하며, 왕래 또는 판매를 추정하는데 사용될 수 있다.
대체 데이터 평가를 위한 기준¶
- 대체 데이터의 궁극적인 목표
- 알파(양+이며 상관관계가 없는 투자 수익)을 창출하는 트레이딩 신호에 대한 경쟁력 있는 검색에서 정보 이점을 제공하는 것
대체 데이터 세트는 신호 내용의 품질, 데이터의 질적 측면, 다양한 기술적 측면에서 평가될 수 있다.
- 신호 내용의 질
- 자산군
- 투자 스타일
- 리스크 프리미엄
- 앞아의 내용과 질
- 데이터의 질
- 법적/평판 리스크
- 독점력
- 투자 기간
- 데이터 빈도
- 데이터 신뢰성
- 기술적 측면
- 레이턴시
- 데이터 형식
대체 데이터로 작업¶
- 웹 스크리핑을 사용해 대체 데이터를 수집
오픈테이블 데이터 스크래핑¶
In [ ]:
!pwd
/content
In [ ]:
!wget https://github.com/mozilla/geckodriver/releases/download/v0.33.0/geckodriver-v0.33.0-linux64.tar.gz
--2023-06-11 12:49:02-- https://github.com/mozilla/geckodriver/releases/download/v0.33.0/geckodriver-v0.33.0-linux64.tar.gz
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/25354393/31e07152-f930-40e0-8011-5495dd63fee9?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230611%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230611T124902Z&X-Amz-Expires=300&X-Amz-Signature=104f8f5d3fabdac7a21371cf66f917583762ccb62597d9b7356dd0f991bd31a6&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=25354393&response-content-disposition=attachment%3B%20filename%3Dgeckodriver-v0.33.0-linux64.tar.gz&response-content-type=application%2Foctet-stream [following]
--2023-06-11 12:49:02-- https://objects.githubusercontent.com/github-production-release-asset-2e65be/25354393/31e07152-f930-40e0-8011-5495dd63fee9?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230611%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230611T124902Z&X-Amz-Expires=300&X-Amz-Signature=104f8f5d3fabdac7a21371cf66f917583762ccb62597d9b7356dd0f991bd31a6&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=25354393&response-content-disposition=attachment%3B%20filename%3Dgeckodriver-v0.33.0-linux64.tar.gz&response-content-type=application%2Foctet-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3074828 (2.9M) [application/octet-stream]
Saving to: ‘geckodriver-v0.33.0-linux64.tar.gz’
geckodriver-v0.33.0 100%[===================>] 2.93M --.-KB/s in 0.01s
2023-06-11 12:49:02 (228 MB/s) - ‘geckodriver-v0.33.0-linux64.tar.gz’ saved [3074828/3074828]
geckodriver
In [ ]:
!tar -xvzf /content/geckodriver-*
geckodriver
In [ ]:
!chmod +x geckodriver
!sudo mv geckodriver /usr/local/bin/
In [ ]:
!pip install selenium
!apt-get update
!apt install firefox
!apt install -y firefox-geckodriver
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: selenium in /usr/local/lib/python3.10/dist-packages (4.10.0)
Requirement already satisfied: urllib3[socks]<3,>=1.26 in /usr/local/lib/python3.10/dist-packages (from selenium) (1.26.15)
Requirement already satisfied: trio~=0.17 in /usr/local/lib/python3.10/dist-packages (from selenium) (0.22.0)
Requirement already satisfied: trio-websocket~=0.9 in /usr/local/lib/python3.10/dist-packages (from selenium) (0.10.3)
Requirement already satisfied: certifi>=2021.10.8 in /usr/local/lib/python3.10/dist-packages (from selenium) (2022.12.7)
Requirement already satisfied: attrs>=19.2.0 in /usr/local/lib/python3.10/dist-packages (from trio~=0.17->selenium) (23.1.0)
Requirement already satisfied: sortedcontainers in /usr/local/lib/python3.10/dist-packages (from trio~=0.17->selenium) (2.4.0)
Requirement already satisfied: async-generator>=1.9 in /usr/local/lib/python3.10/dist-packages (from trio~=0.17->selenium) (1.10)
Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from trio~=0.17->selenium) (3.4)
Requirement already satisfied: outcome in /usr/local/lib/python3.10/dist-packages (from trio~=0.17->selenium) (1.2.0)
Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from trio~=0.17->selenium) (1.3.0)
Requirement already satisfied: exceptiongroup>=1.0.0rc9 in /usr/local/lib/python3.10/dist-packages (from trio~=0.17->selenium) (1.1.1)
Requirement already satisfied: wsproto>=0.14 in /usr/local/lib/python3.10/dist-packages (from trio-websocket~=0.9->selenium) (1.2.0)
Requirement already satisfied: PySocks!=1.5.7,<2.0,>=1.5.6 in /usr/local/lib/python3.10/dist-packages (from urllib3[socks]<3,>=1.26->selenium) (1.7.1)
Requirement already satisfied: h11<1,>=0.9.0 in /usr/local/lib/python3.10/dist-packages (from wsproto>=0.14->trio-websocket~=0.9->selenium) (0.14.0)
Get:1 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease [3,622 B]
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 InRelease
Get:3 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Hit:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu focal InRelease
Hit:5 http://archive.ubuntu.com/ubuntu focal InRelease
Get:6 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Hit:7 http://ppa.launchpad.net/cran/libgit2/ubuntu focal InRelease
Hit:8 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu focal InRelease
Get:9 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]
Hit:10 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu focal InRelease
Hit:11 http://ppa.launchpad.net/ubuntugis/ppa/ubuntu focal InRelease
Get:12 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages [2,776 kB]
Get:13 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages [3,255 kB]
Get:14 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 Packages [1,354 kB]
Fetched 7,725 kB in 2s (4,129 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
libdbus-glib-1-2 libdbusmenu-glib4 libdbusmenu-gtk3-4 libxtst6
xul-ext-ubufox
Suggested packages:
fonts-lyx
The following NEW packages will be installed:
firefox libdbus-glib-1-2 libdbusmenu-glib4 libdbusmenu-gtk3-4 libxtst6
xul-ext-ubufox
0 upgraded, 6 newly installed, 0 to remove and 38 not upgraded.
Need to get 61.0 MB of archives.
After this operation, 245 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/main amd64 libdbus-glib-1-2 amd64 0.110-5fakssync1 [59.1 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal/main amd64 libxtst6 amd64 2:1.2.3-1 [12.8 kB]
Get:3 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 firefox amd64 114.0+build3-0ubuntu0.20.04.1 [60.8 MB]
Get:4 http://archive.ubuntu.com/ubuntu focal/main amd64 libdbusmenu-glib4 amd64 16.04.1+18.10.20180917-0ubuntu6 [41.2 kB]
Get:5 http://archive.ubuntu.com/ubuntu focal/main amd64 libdbusmenu-gtk3-4 amd64 16.04.1+18.10.20180917-0ubuntu6 [27.7 kB]
Get:6 http://archive.ubuntu.com/ubuntu focal/main amd64 xul-ext-ubufox all 3.4-0ubuntu1.17.10.1 [3,320 B]
Fetched 61.0 MB in 2s (25.1 MB/s)
Selecting previously unselected package libdbus-glib-1-2:amd64.
(Reading database ... 122541 files and directories currently installed.)
Preparing to unpack .../0-libdbus-glib-1-2_0.110-5fakssync1_amd64.deb ...
Unpacking libdbus-glib-1-2:amd64 (0.110-5fakssync1) ...
Selecting previously unselected package libxtst6:amd64.
Preparing to unpack .../1-libxtst6_2%3a1.2.3-1_amd64.deb ...
Unpacking libxtst6:amd64 (2:1.2.3-1) ...
Selecting previously unselected package firefox.
Preparing to unpack .../2-firefox_114.0+build3-0ubuntu0.20.04.1_amd64.deb ...
Unpacking firefox (114.0+build3-0ubuntu0.20.04.1) ...
Selecting previously unselected package libdbusmenu-glib4:amd64.
Preparing to unpack .../3-libdbusmenu-glib4_16.04.1+18.10.20180917-0ubuntu6_amd64.deb ...
Unpacking libdbusmenu-glib4:amd64 (16.04.1+18.10.20180917-0ubuntu6) ...
Selecting previously unselected package libdbusmenu-gtk3-4:amd64.
Preparing to unpack .../4-libdbusmenu-gtk3-4_16.04.1+18.10.20180917-0ubuntu6_amd64.deb ...
Unpacking libdbusmenu-gtk3-4:amd64 (16.04.1+18.10.20180917-0ubuntu6) ...
Selecting previously unselected package xul-ext-ubufox.
Preparing to unpack .../5-xul-ext-ubufox_3.4-0ubuntu1.17.10.1_all.deb ...
Unpacking xul-ext-ubufox (3.4-0ubuntu1.17.10.1) ...
Setting up libxtst6:amd64 (2:1.2.3-1) ...
Setting up libdbusmenu-glib4:amd64 (16.04.1+18.10.20180917-0ubuntu6) ...
Setting up libdbus-glib-1-2:amd64 (0.110-5fakssync1) ...
Setting up xul-ext-ubufox (3.4-0ubuntu1.17.10.1) ...
Setting up libdbusmenu-gtk3-4:amd64 (16.04.1+18.10.20180917-0ubuntu6) ...
Setting up firefox (114.0+build3-0ubuntu0.20.04.1) ...
update-alternatives: using /usr/bin/firefox to provide /usr/bin/gnome-www-browser (gnome-www-browser) in auto mode
update-alternatives: using /usr/bin/firefox to provide /usr/bin/x-www-browser (x-www-browser) in auto mode
Please restart all running instances of firefox, or you will experience problems.
Processing triggers for man-db (2.9.1-1) ...
Processing triggers for mime-support (3.64ubuntu1) ...
Processing triggers for hicolor-icon-theme (0.17-2) ...
Processing triggers for libc-bin (2.31-0ubuntu9.9) ...
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
firefox-geckodriver
0 upgraded, 1 newly installed, 0 to remove and 38 not upgraded.
Need to get 1,278 kB of archives.
After this operation, 4,431 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 firefox-geckodriver amd64 114.0+build3-0ubuntu0.20.04.1 [1,278 kB]
Fetched 1,278 kB in 1s (1,873 kB/s)
Selecting previously unselected package firefox-geckodriver.
(Reading database ... 122670 files and directories currently installed.)
Preparing to unpack .../firefox-geckodriver_114.0+build3-0ubuntu0.20.04.1_amd64.deb ...
Unpacking firefox-geckodriver (114.0+build3-0ubuntu0.20.04.1) ...
Setting up firefox-geckodriver (114.0+build3-0ubuntu0.20.04.1) ...
In [ ]:
!pip install webdriver_manager
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting webdriver_manager
Downloading webdriver_manager-3.8.6-py2.py3-none-any.whl (27 kB)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from webdriver_manager) (2.27.1)
Collecting python-dotenv (from webdriver_manager)
Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from webdriver_manager) (4.65.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from webdriver_manager) (23.1)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->webdriver_manager) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->webdriver_manager) (2022.12.7)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.10/dist-packages (from requests->webdriver_manager) (2.0.12)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->webdriver_manager) (3.4)
Installing collected packages: python-dotenv, webdriver_manager
Successfully installed python-dotenv-1.0.0 webdriver_manager-3.8.6
In [ ]:
# coding: utf-8
import re
from time import sleep
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
# 클래스가 암호화되서 직접 찾아가며 크롤링 해야함
def parse_html(page_source):
"""Parse content from various tags from OpenTable restaurants listing"""
data = []
soup = BeautifulSoup(page_source, 'html.parser')
for resto in soup.find_all('div', class_='rest-row-info'):
item = {}
item['name'] = resto.find('span', class_='rest-row-name-text').text
booking = resto.find('div', class_='booking')
item['bookings'] = re.search(r'\d+', booking.text).group() if booking else 'NA'
rating = resto.find('div', class_='star-rating-score')
item['rating'] = float(rating['aria-label'].split()[0]) if rating else 'NA'
reviews = resto.find('span', class_='underline-hover')
item['reviews'] = int(re.search(r'\d+', reviews.text).group()) if reviews else 'NA'
pricing = resto.find('div', class_='rest-row-pricing')
item['price'] = pricing.find_all('i').count('$') if pricing else 0
cuisine = resto.find('span', class_='rest-row-meta--cuisine rest-row-meta-text sfx1388addContent')
item['cuisine'] = cuisine.text.strip() if cuisine else ''
location = resto.find('span', class_='rest-row-meta--location rest-row-meta-text sfx1388addContent')
item['location'] = location.text.strip() if location else ''
data.append(item)
return pd.DataFrame(data)
from selenium.webdriver.firefox.options import Options
options = Options()
options.add_argument('--headless') # 브라우저 창을 표시하지 않습니다.
# Start selenium and click through pages until reach end
# store results by iteratively appending to csv file
driver = webdriver.Firefox(options=options)
url = "https://www.opentable.com/new-york-restaurant-listings"
driver.get(url)
page = collected = 0
while True:
sleep(1)
new_data = parse_html(driver.page_source)
print(new_data)
if new_data.empty:
break
if page == 0:
new_data.to_csv('results.csv', index=False)
elif page > 0:
new_data.to_csv('results.csv', index=False, header=None, mode='a')
page += 1
collected += len(new_data)
print(f'Page: {page} | Downloaded: {collected}')
driver.find_element_by_link_text('Next').click()
driver.close()
restaurants = pd.read_csv('results.csv')
print(restaurants)
Empty DataFrame
Columns: []
Index: []
어닝콜 트랜스크립트 스크래핑과 파싱¶
In [ ]:
!pip install furl
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting furl
Downloading furl-2.1.3-py2.py3-none-any.whl (20 kB)
Requirement already satisfied: six>=1.8.0 in /usr/local/lib/python3.10/dist-packages (from furl) (1.16.0)
Collecting orderedmultidict>=1.0.1 (from furl)
Downloading orderedmultidict-1.0.1-py2.py3-none-any.whl (11 kB)
Installing collected packages: orderedmultidict, furl
Successfully installed furl-2.1.3 orderedmultidict-1.0.1
In [ ]:
!mkdir parsed
In [ ]:
import re
from pathlib import Path
from random import random
from time import sleep
from urllib.parse import urljoin
import pandas as pd
from bs4 import BeautifulSoup
from furl import furl
from selenium import webdriver
# transcript_path = Path('/content/')
def store_result(meta, participants, content):
"""Save parse content to csv"""
path = '/content/parsed/'+meta['symbol'] # 경로 수동 지정
if not path.exists():
path.mkdir(parents=True, exist_ok=True)
pd.DataFrame(content, columns=['speaker', 'q&a', 'content']).to_csv(path / 'content.csv', index=False)
pd.DataFrame(participants, columns=['type', 'name']).to_csv(path / 'participants.csv', index=False)
pd.Series(meta).to_csv(path / 'earnings.csv')
# 클래스가 암호화되서 직접 찾아가며 크롤링 해야함
def parse_html(html):
"""Main html parser function"""
date_pattern = re.compile(r'(\d{2})-(\d{2})-(\d{2})')
quarter_pattern = re.compile(r'(\bQ\d\b)')
soup = BeautifulSoup(html, 'lxml')
meta, participants, content = {}, [], []
h1 = soup.find('h1', itemprop='headline')
if h1 is None:
return
h1 = h1.text
meta['company'] = h1[:h1.find('(')].strip()
meta['symbol'] = h1[h1.find('(') + 1:h1.find(')')]
title = soup.find('div', class_='title')
if title is None:
return
title = title.text
print(title)
match = date_pattern.search(title)
if match:
m, d, y = match.groups()
meta['month'] = int(m)
meta['day'] = int(d)
meta['year'] = int(y)
match = quarter_pattern.search(title)
if match:
meta['quarter'] = match.group(0)
qa = 0
speaker_types = ['Executives', 'Analysts']
for header in [p.parent for p in soup.find_all('strong')]:
text = header.text.strip()
if text.lower().startswith('copyright'):
continue
elif text.lower().startswith('question-and'):
qa = 1
continue
elif any([type in text for type in speaker_types]):
for participant in header.find_next_siblings('p'):
if participant.find('strong'):
break
else:
participants.append([text, participant.text])
else:
p = []
for participant in header.find_next_siblings('p'):
if participant.find('strong'):
break
else:
p.append(participant.text)
content.append([header.text, qa, '\n'.join(p)])
return meta, participants, content
SA_URL = 'https://seekingalpha.com/'
TRANSCRIPT = re.compile('Earnings Call Transcript')
next_page = True
page = 1
from selenium.webdriver.firefox.options import Options
options = Options()
options.add_argument('--headless') # 브라우저 창을 표시하지 않습니다.
# Start selenium and click through pages until reach end
# store results by iteratively appending to csv file
driver = webdriver.Firefox(options=options)
while next_page:
print(f'Page: {page}')
url = f'{SA_URL}/earnings/earnings-call-transcripts/{page}'
driver.get(urljoin(SA_URL, url))
sleep(8 + (random() - .5) * 2)
response = driver.page_source
page += 1
soup = BeautifulSoup(response, 'lxml')
links = soup.find_all(name='a', string=TRANSCRIPT)
if len(links) == 0:
next_page = False
else:
for link in links:
transcript_url = link.attrs.get('href')
article_url = furl(urljoin(SA_URL, transcript_url)).add({'part': 'single'})
driver.get(article_url.url)
html = driver.page_source
result = parse_html(html)
print(result)
if result is not None:
meta, participants, content = result
meta['link'] = link
store_result(meta, participants, content)
sleep(8 + (random() - .5) * 2)
driver.close()
Page: 1
'공부 > ML4T' 카테고리의 다른 글
CH06 머신러닝 프로세스 (0) | 2023.06.30 |
---|---|
CH05 포트폴리오 최적화와 성과 평가 (0) | 2023.06.24 |
CH04 금융 특성 공학: 알파 팩터 리서치 (0) | 2023.06.24 |
CH02 시장 데이터와 기본 데이터: 소스와 기법 (0) | 2023.06.17 |
CH01 트레이딩용 머신러닝: 아이디어에서 주문 집행까지 (0) | 2023.06.06 |