Alternative Data Trading using Python, Google Trends, NYT Articles, and Cohere
This article is co-authored by my co-creator Alex Zhu.
Description
You can find the GitHub repo here.
The B2 stock trader is an application that uses Google Trends and Cohere’s sentiment analysis on NYT articles to pick a portfolio of stocks. It uses an algorithmic strategy of finding undervalued (low Google interest) but highly-rated (high NYT sentiment scores) stocks that others neglect looking at because most people focus on popular but likely overvalued stocks. We were inspired by B2EMO, a ground mech salvage assist droid in Star Wars: Andor that looked through junkyards for valuable scrap. Similarly, we scrape the internet to find undervalued stocks.
We built this project to learn about sourcing and processing data ourselves for the purpose of alternative data trading, improve our abilities to work with libraries and APIs, and practice using Git and debugging projects; we succeeded at reaching that goal. The goal was not necessarily to create a safe portfolio that maximizes profit which would be helpful in the real world. However, B2 was still highly successful in choosing profitable portfolios.
We tested starting from 2008, which is as far back as we could go to find a list of that year’s Nasdaq-100 stocks. We used the Nasdaq-100 stocks as a starting point, then since we were trading on the NYSE, we pre-processed the 100 stocks to approximately 80 that were traded on the NYSE and not just the NASDAQ. From there, we implemented our two strategies to pick around 10 stocks to form our portfolio:
NYTScraper — long, high-brand value companies:
High-brand value companies build up a better reputation increasing customer loyalty and employee retention. In the long run, these effects multiply themselves and lead the company to a large potential for growth.
In the NYTScraper, the pynytimes API is used to get headlines of articles from the New York Times, AP, Reuters and International Herald Tribune containing a stock’s name, then Cohere performs a sentiment analysis using custom training data to determine if the headlines are positive or negative. The scores are averaged out to generate a sentiment score.
GoogleScraper — long, low-attention companies:
While low volatility strategies have been around for a long time, using our improved technologies, there are better ways to measure low risk or low attention of investors to find undervalued companies. The premise of this strategy is that these stocks tend to be underpriced, and once the attention turns back toward them, their prices tend to increase (Source). Secondly, these stocks are usually associated with lower risk.
In the GoogleScraper, the pytrends API is used to get historical data from Google Trends of stock tickers. Stocks are compared, and those with low search volume receive a more favourable search score.
Since both our strategies rely on longer time frames, we chose dates that were at least 10 years earlier than the current year (2023). Running the code on data from 2015 and then checking it in 2016 led to drastically worse performance about half as good as the S&P 500 however testing it on timeframes longer than 10 years consistently beat the S&P 500 index.
i = 0
# take top 10 stocks and stop if scores are negative (avoiding loss)
while (i < 10 and max(localFinalScores) > 0):
indexOfBestStocks.append(localFinalScores.index(max(localFinalScores)))
localFinalScores[localFinalScores.index(max(localFinalScores))] = -1
i += 1
Once a final score that took the NYT and Google scores into account was generated for each stock, a portfolio was generated and backtested using the yfinance API. We tested more in 2010 and 2012, reaching profits of up to 633% and 648%, respectively.
What makes our project special is that we did all of it at no cost. Finance and quant are largely gatekept; it costs money to gain access to historical stock data, SEO information, alternative data sources, financial advice, and more. We found creative ways to source the information we needed, such as using the Wayback Machine, albeit sometimes limited by API call limits and the volume of available information.
Challenges Faced
- Avoiding Look-Ahead Bias
Look-ahead bias is a type of bias where data that was not yet available at a time is being used to analyze that time. For us, an example of look-ahead bias would be investing in today’s biggest companies in 2008 when they were still small. To avoid look-ahead bias, we made sure to choose stocks from companies that were already in the Nasdaq-100 in that year, not in the present day. This meant using the Wayback Machine to get old data. We also avoided it by analyzing Google Trends and NYT articles only from before the investment date.
2. Pytrends API Limits
Pytrends, the API for getting Google Trends data, had three main issues to address:
Firstly, you could only compare the historical search volume of 5 search terms at once, and they would return scores relative to each other. This meant that to analyze, let’s say, 85 stocks, we would have to call the API 21 times. The first call would analyze 5 stocks relative to each other, then one of those stocks would become the normalizing stock. The next API calls would all include that normalizing stock and the next 4 stocks, making 20 more calls for the remaining 80 stocks. In the end, all the data would be normalized so there would be scores of all the stocks relative to each other.
For example, if apples to oranges was 2: 1, and apples to grapes was 1: 3, then by using apples as the normalizing stock, we can get that apples to oranges to grapes was 2: 1: 6.
The second big issue was a limited amount of API call volume, so we only analyzed a month of trends data at once to avoid errors.
The final issue is that there would often be no data available for specific search terms in irregular situations. We assume that this is an issue on the Google Trends side, as it could be largely avoided by polling a larger time frame or re-running the code for those search terms. One likely explanation stems from the fact that Google Trends search results are all shown relative to their peak popularity. If a company has a huge peak in interest for a short time, then other time frames where it has less interest get pushed much lower.
In the image above, $BIIB peaked very high in the middle of 2021, pushing its interest to 0 in May 2012 and other days (Google Trends also does not return decimals, so search volume often just rounded to 0). In our 2012 portfolio, a lot of stocks returned with 0, possibly because the normalizing stock was too popular at the time, pushing other stocks to 0 in all their timeframes.
3. Choosing an API for Sentiment Analysis
At first, we used tweepy to analyze tweets about a stock ticker. However, you had to pay for the Twitter API that let you look at tweets more than 30 days old, so because we had to avoid look-ahead bias, this was not an option. We switched to using pynytimes instead and analyzing the sentiment of news articles.
Next Steps
- Fix Google Trends Data
Our stock picks could be much more informed with more Google Trends data. If we could poll a larger time frame and remove the cases where there was no data available, we would have a lot more stocks to pick from and more information on each of them.
2. Make the Sentiment Smarter
The Cohere natural language processing often gives questionable sentiment classifications and confidence levels. One reason is that it doesn’t know who the subject is, the context of the situation, or if the article is even relevant at all. Some examples of these problems are shown below for the company Biogen $BIIB:
"Riding High, Biotech Firms Remain Wary" → Classification "positive",
Confidence: 0.87216884
"Kyle Bass Wields New Weapon in Challenging Drug Makers" → Classification
"positive", Confidence: 0.809953
"Your Friday Briefing" → Classification "negative", Confidence: 0.8532918
Some ideas are to filter out certain types of articles that are likely unrelated to finance, give more training data, or give it more to analyze than just the headlines of the articles.
3. Market Correlation and Risks
On an honest note, even though our project performed well over a couple of years, this is not financial advice and you should not invest real money in it. Our model has a much higher risk than most good models. Using the performance summary of 2012, our model’s worst year was down 41.26% while the S&P 500’s worst year was down 18.23%. Hopefully, after reading through all these problems you are more convinced not to put money into it.
4. Getting More Data
Using pytrends, we can actually get related terms to the company name being searched. We could check articles containing those related terms for more articles to perform sentiment analysis on. We could also analyze if the current trend of search volume is going up or down to predict if the stock is likely to increase or decrease in value in the future. Finally, we could use the Twitter API, filter out bots, and have a third data source to consider.
5. Using Machine Learning
At this point in the project, one of the least optimized sections is how we choose our stocks, since it is simply a calculation of NYTScore minus SearchScore. A lot of our top stocks are being picked simply because they have the highest NYTScore. Since those are normalized on a scale of 0 to 1, the high scores are all nearly guaranteed to be included in our final portfolio.
Using different models, we could potentially find a better stock-picking algorithm. We could try to use a machine learning model to pick better stocks, as it is almost certain that high brand value and low attention are not equally important for the most profit. In fact, it is even possible that our strategy itself is actually wrong, and google search popularity is not inversely correlated with long-term stock growth at all. Some common starting points for models are multilayered perceptrons, decision trees, random forests, or support vector machines.
Credits
This project was co-created by Alex Zhu and Ryan Shen, with research and use of:
- https://quantpedia.com/six-examples-of-trading-strategies-that-use-alternative-data/
- https://medium.com/swlh/build-an-ai-stock-trading-bot-for-free-4a46bec2a18
- https://web.archive.org/web/20160402172246/http://siblisresearch.com/data/historical-components-nasdaq
- https://medium.com/analytics-vidhya/compare-more-than-5-keywords-in-google-trends-search-using-pytrends-3462d6b5ad62
- https://www.portfoliovisualizer.com/backtest-portfolio#analysisResults