webscraping-header

Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.

# Loading libraries
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

Here is the link to Trump’s lies artcile from NY-Times: LINK

Here is the structure of the first lie:

Jan. 21 “I wasn’t a fan of Iraq. I didn’t want to go into Iraq.” (He was for an invasion before he was against it.)  

link = "https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html"
response = requests.get(link)

Collecting all the records

soup = BeautifulSoup(response.text, "html.parser")
results = soup.find_all(name = "span", attrs = {"class" : "short-desc"})
len(results)
180
results[0]
<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

Parse the first lie into 4 structured columns:

  • Date
  • Lie
  • Description
  • Link

The first result to see the pattern of the html

r = results[0]
r
<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

Parsing date by looking at the strong tag

date = r.find("strong").text[:-1] + ", 2017"
date
'Jan. 21, 2017'

Parsing lie by parsing the content out

lie = r.contents[1][1:-2]
lie
"I wasn't a fan of Iraq. I didn't want to go into Iraq."

Parsing explanation by looking at the a tag

explanation = r.find("a").text[1:-1]
explanation
'He was for an invasion before he was against it.'

Parsing link out by looking at the href key of the a tag

link = r.find("a")["href"]
link
'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

Now, iterating over the whole records and put it into a dataframe

rows = []
for r in results:
    date = r.find("strong").text[:-1] + ", 2017"
    lie = r.contents[1][1:-2]
    explanation = r.find("a").text[1:-1]
    link = r.find("a")["href"]
    rows.append( (date, lie, explanation, link) )
df = pd.DataFrame(data = rows, columns=["Date", "Lie", "Explanation", "Link"])
df["Date"] = pd.to_datetime(df["Date"])
df.head(20)
Date Lie Explanation Link
0 2017-01-21 I wasn't a fan of Iraq. I didn't want to go in... He was for an invasion before he was against it. https://www.buzzfeed.com/andrewkaczynski/in-20...
1 2017-01-21 A reporter for Time magazine — and I have been... Trump was on the cover 11 times and Nixon appe... http://nation.time.com/2013/11/06/10-things-yo...
2 2017-01-23 Between 3 million and 5 million illegal votes ... There's no evidence of illegal voting. https://www.nytimes.com/2017/01/23/us/politics...
3 2017-01-25 Now, the audience was the biggest ever. But th... Official aerial photos show Obama's 2009 inaug... https://www.nytimes.com/2017/01/21/us/politics...
4 2017-01-25 Take a look at the Pew reports (which show vot... The report never mentioned voter fraud. https://www.nytimes.com/2017/01/24/us/politics...
5 2017-01-25 You had millions of people that now aren't ins... The real number is less than 1 million, accord... https://www.nytimes.com/2017/03/13/us/politics...
6 2017-01-25 So, look, when President Obama was there two w... There were no gun homicide victims in Chicago ... https://www.dnainfo.com/chicago/2017-chicago-m...
7 2017-01-26 We've taken in tens of thousands of people. We... Vetting lasts up to two years. https://www.nytimes.com/interactive/2017/01/29...
8 2017-01-26 I cut off hundreds of millions of dollars off ... Most of the cuts were already planned. https://www.washingtonpost.com/news/fact-check...
9 2017-01-28 The coverage about me in the @nytimes and the ... It never apologized. https://www.nytimes.com/2016/11/13/us/election...
10 2017-01-29 The Cuban-Americans, I got 84 percent of that ... There is no support for this. http://www.pewresearch.org/fact-tank/2016/11/1...
11 2017-01-30 Only 109 people out of 325,000 were detained a... At least 746 people were detained and processe... http://markets.on.nytimes.com/research/stocks/...
12 2017-02-03 Professional anarchists, thugs and paid protes... There is no evidence of paid protesters. https://www.nytimes.com/2017/01/28/nyregion/jf...
13 2017-02-04 After being forced to apologize for its bad an... It never apologized. https://www.nytimes.com/2016/11/13/us/election...
14 2017-02-05 We had 109 people out of hundreds of thousands... About 60,000 people were affected. http://www.politifact.com/truth-o-meter/statem...
15 2017-02-06 I have already saved more than $700 million wh... Much of the price drop was projected before Tr... https://www.washingtonpost.com/news/fact-check...
16 2017-02-06 It's gotten to a point where it is not even be... Terrorism has been reported on, often in detail. https://www.nytimes.com/2017/02/07/us/politics...
17 2017-02-06 The failing @nytimes was forced to apologize t... It didn't apologize. https://www.nytimes.com/2016/11/13/us/election...
18 2017-02-06 And the previous administration allowed it to ... The group’s origins date to 2004. https://www.nytimes.com/2015/11/19/world/middl...
19 2017-02-07 And yet the murder rate in our country is the ... It was higher in the 1980s and '90s. http://www.politifact.com/truth-o-meter/statem...