Semalt – How To Scrape Web Pages?
Beautiful Soup is a Python library widely used to scrape web pages by creating a parse tree from XML and HTML documents. Web scraping, a technique of extracting data from websites and pages, is widely used in data analysis and management fields. In most cases, Python programming language is a prerequisite in data science.
Python 3 has scraping tools and modules you can apply to your data management project. Currently running as Beautiful Soup 4, this module is compatible with both Python 3 and Python 2.7. Beautiful Soup 4 module is also capable of creating a parse tree for non-closed tag soup. In this tutorial, you'll learn how to scrape the page and write the scraped data to a CSV file.
Getting started
To get started, set up a server or local-based Python coding environment on your PC. You should also install Beautiful Soup and Requests module on your machine. Knowledge of working with both modules is also a necessary prerequisite. Familiarity with HTML tagging and structure is also an added advantage.
Understanding your data
In this context, real data from the National Gallery of Art will be used to help you understand how to use Beautiful Soup 4. National Gallery of Art comprises of 120,000 pieces that are done by an approximate of 13,000 artists. The Art is based in Washington D.C, United States.
Web data extraction with Beautiful Soup is not that complicated. For example, if you focus on letter Z, mark and note down the first name on the list. In this case, the first name is Zabaglia, Niccola. For consistency, indicate the number of pages and the name of the last artist on that page.
How to import Requests and Beautiful Soup library
To import libraries, activate your Python 3 programming environment. Check to make sure you are in the same directory with your programming environment. Run the following command to get started. my_env/bin/activate.
Create a new file and start importing Beautiful Soup and Requests libraries. Requests library will allow you to use HTTP within your Python programs in readable formats. Beautiful Soup, on the other hand, works to scrape pages quickly. Use bs4 to import Beautiful Soup.
How to collect and parse a web page
Using Requests collect URL of your first page. URL of the first page will be assigned to the variable page. Build a BeautifulSoup object from Requests and parse the object from Python's parser.
In this tutorial, the aim is to collect links and the artists' names. For instance, you can collect artists' dates and nationalities. For Windows users, right click on the artist's first name. In this case, use Zabaglia, Niccola. For Mac OS users, tap "CTRL" and click the name. Click the "Inspect Element" menu that pop-ups on your screen to access web developers' tools. Print the artist's names out to make Beautiful Soup parse a tree quickly.
Removing the bottom links
To remove the bottom links on your web page, inspect the DOM by right-clicking the element. You'll identify that the links are under an HTML table. Using Beautiful Soup, use the "decompose method" to remove tags from the parse tree.
How to pull content from a tag
You don't have to print the entire link tag, use Beautiful Soup to remove material from a tag. You can also capture URLs associated with the artists by using Beautiful Soup 4.
Capturing scraped data to a CSV file
CSV file will allow you to store structured data in a plain text, a format that is mostly used for datasheets. Knowledge on handling plain text files in Python is recommended.
Web data extraction is used to scrape pages and obtain information. Be considerate of the websites you are extraction information from. Some dynamic websites restrict web data extraction on their sites. To scrape page with Beautiful Soup and Python 3 is that simple.