Web Scraping using BeautifulSoup
Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.
# Loading libraries
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
Here is the link to Trump’s lies artcile from NY-Times: LINK
Here is the structure of the first lie:
Jan. 21 “I wasn’t a fan of Iraq. I didn’t want to go into Iraq.” (He was for an invasion before he was against it.)
link = "https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html"
response = requests.get(link)
Collecting all the records
soup = BeautifulSoup(response.text, "html.parser")
results = soup.find_all(name = "span", attrs = {"class" : "short-desc"})
len(results)
180
results[0]
<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>
Parse the first lie into 4 structured columns:
- Date
- Lie
- Description
- Link
The first result to see the pattern of the html
r = results[0]
r
<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>
Parsing date by looking at the strong tag
date = r.find("strong").text[:-1] + ", 2017"
date
'Jan. 21, 2017'
Parsing lie by parsing the content out
lie = r.contents[1][1:-2]
lie
"I wasn't a fan of Iraq. I didn't want to go into Iraq."
Parsing explanation by looking at the a tag
explanation = r.find("a").text[1:-1]
explanation
'He was for an invasion before he was against it.'
Parsing link out by looking at the href key of the a tag
link = r.find("a")["href"]
link
'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'
Now, iterating over the whole records and put it into a dataframe
rows = []
for r in results:
date = r.find("strong").text[:-1] + ", 2017"
lie = r.contents[1][1:-2]
explanation = r.find("a").text[1:-1]
link = r.find("a")["href"]
rows.append( (date, lie, explanation, link) )
df = pd.DataFrame(data = rows, columns=["Date", "Lie", "Explanation", "Link"])
df["Date"] = pd.to_datetime(df["Date"])
df.head(20)
Date | Lie | Explanation | Link | |
---|---|---|---|---|
0 | 2017-01-21 | I wasn't a fan of Iraq. I didn't want to go in... | He was for an invasion before he was against it. | https://www.buzzfeed.com/andrewkaczynski/in-20... |
1 | 2017-01-21 | A reporter for Time magazine — and I have been... | Trump was on the cover 11 times and Nixon appe... | http://nation.time.com/2013/11/06/10-things-yo... |
2 | 2017-01-23 | Between 3 million and 5 million illegal votes ... | There's no evidence of illegal voting. | https://www.nytimes.com/2017/01/23/us/politics... |
3 | 2017-01-25 | Now, the audience was the biggest ever. But th... | Official aerial photos show Obama's 2009 inaug... | https://www.nytimes.com/2017/01/21/us/politics... |
4 | 2017-01-25 | Take a look at the Pew reports (which show vot... | The report never mentioned voter fraud. | https://www.nytimes.com/2017/01/24/us/politics... |
5 | 2017-01-25 | You had millions of people that now aren't ins... | The real number is less than 1 million, accord... | https://www.nytimes.com/2017/03/13/us/politics... |
6 | 2017-01-25 | So, look, when President Obama was there two w... | There were no gun homicide victims in Chicago ... | https://www.dnainfo.com/chicago/2017-chicago-m... |
7 | 2017-01-26 | We've taken in tens of thousands of people. We... | Vetting lasts up to two years. | https://www.nytimes.com/interactive/2017/01/29... |
8 | 2017-01-26 | I cut off hundreds of millions of dollars off ... | Most of the cuts were already planned. | https://www.washingtonpost.com/news/fact-check... |
9 | 2017-01-28 | The coverage about me in the @nytimes and the ... | It never apologized. | https://www.nytimes.com/2016/11/13/us/election... |
10 | 2017-01-29 | The Cuban-Americans, I got 84 percent of that ... | There is no support for this. | http://www.pewresearch.org/fact-tank/2016/11/1... |
11 | 2017-01-30 | Only 109 people out of 325,000 were detained a... | At least 746 people were detained and processe... | http://markets.on.nytimes.com/research/stocks/... |
12 | 2017-02-03 | Professional anarchists, thugs and paid protes... | There is no evidence of paid protesters. | https://www.nytimes.com/2017/01/28/nyregion/jf... |
13 | 2017-02-04 | After being forced to apologize for its bad an... | It never apologized. | https://www.nytimes.com/2016/11/13/us/election... |
14 | 2017-02-05 | We had 109 people out of hundreds of thousands... | About 60,000 people were affected. | http://www.politifact.com/truth-o-meter/statem... |
15 | 2017-02-06 | I have already saved more than $700 million wh... | Much of the price drop was projected before Tr... | https://www.washingtonpost.com/news/fact-check... |
16 | 2017-02-06 | It's gotten to a point where it is not even be... | Terrorism has been reported on, often in detail. | https://www.nytimes.com/2017/02/07/us/politics... |
17 | 2017-02-06 | The failing @nytimes was forced to apologize t... | It didn't apologize. | https://www.nytimes.com/2016/11/13/us/election... |
18 | 2017-02-06 | And the previous administration allowed it to ... | The group’s origins date to 2004. | https://www.nytimes.com/2015/11/19/world/middl... |
19 | 2017-02-07 | And yet the murder rate in our country is the ... | It was higher in the 1980s and '90s. | http://www.politifact.com/truth-o-meter/statem... |