INTRODUCTION
In the early days, that is before the digital age, the word currency meant to be different things like- gold, silver, raw materials etc. but today in this digital world, data has become the new currency. And extracting data from the web brings valuable information to the table that is useful for businesses and researchers. Machine learning is playing an important role in inventing tech marvels like driverless cars, image and speech recognization and much more. for all this to happen data scientists need a large amount of data to build such robust and reliable ML models. Web scraping, the process of automatically retrieving and extracting data from websites, has emerged as a powerful technique to gather insights, monitor trends, and automate repetitive tasks. In this blog, we will explore the world of web scraping using Python. so let's get started.
TABLE OF CONTENTS:-
What is Selenium
Basics of Selenium with Python
Setup & Tools | Chrome WebDriver | Working with Selenium
Locating elements From HTML
Page Navigating and Clicking Elements
3. Putting it all together
WHAT IS SELENIUM?
Selenium is a popular open-source framework used for automating web browsers. It provides a set of tools and libraries that allow developers to interact with web applications, simulate user actions, and perform automated testing. With Selenium, users can write scripts in various programming languages, such as Python, Java, or C#, to automate tasks like clicking buttons, filling out forms, and navigating through web pages.
It supports multiple browsers like Chrome, Firefox, and Safari, enabling cross-browser testing. Selenium is widely used in web scraping, web testing, and web application development, offering a powerful solution for automating web-based tasks efficiently and reliably.
BASICS OF SELENIUM WITH PYTHON
So you got all the basic idea of what is web scraping, selenium, and how important data is for ML models. let’s start learning the basics of Selenium with Python!!!
Setup & tools:-
Installation:- install selenium using pip, open your command prompt, and write pip install selenium
Download Chrome Driver:To download web drivers, you can choose any of the below methods-
You can either directly download chrome driver from the below link-https://chromedriver.chromium.org/downloads
Or, you can download it directly using the below line of code-driver = webdriver.Chrome(ChromeDriverManager().install())
You can find complete documentation on selenium here. You can go through it if needed. Documentation is very much self-explanatory so make sure to read it to leverage selenium with Python.
Now all the tools are ready, let’s code!!!
Chrome WebDriver and working with Selenium
import selenium
from selenium import webdriver
PATH = "C:\Program files (x86)\chromedriver.exe" # path will be available in your c drive in program file (x86) folder.
driver = webdriver.Chrome(PATH) # using our driver to use Chrome as a platform for scraping
driver.get("https://www.consultanubhav.com") # with the help of it driver will fetch the website and display it when the code is build
driver.title()
print(driver.title)
driver.close()
driver.quit()
we can use various basic commands like:
driver.title() - it will display the title of the web page, when printed
driver.close() - will close a single tab
driver.quit()- will close whole page
And it is done, you have completed the very basics with selenium using python. now lets move to another phase.
Locating elements from HTML
first of all, lets import all the required libraries
from selenium.webdriver.common.keys import keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from selenium.webdriver.common.keys import keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://techwithtim.net")
print(driver.title)
search = driver.find_element_by name("s")
search.send_keys("test")
search.send_keys(Keys.RETURN)
try:
main = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "main"))
)
articles = main.find_elements_by_tag_name("article")
for article in articles:
header = article.find_elements_by_class_name("entry-summary")
print(header.text)
finally:
driver.quit()
Now lets break the upper code and understand what it means,
search = driver.find_element_by name("s")
search.send_keys("test")
search.send_keys(Keys.RETURN)
this means, our driver will find elements by name “s”. when we inspect(which is getting the html code of it), there are methods with the help of which we can access the elements. basically, we access elements by ID, name, and class.
search = driver.find_element_by name("s")
this will simply return to us an object that represents that search bar that we can now actually interact with.
search.send_keys("test")
search.send_keys(Keys.RETURN)
Will send the text “test” into the search bar, and Keys.RETURN will automatically press enter and the text “test” will be searched.
try:
main = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "main"))
)
articles = main.find_elements_by_tag_name("article")
for article in articles:
header = article.find_elements_by_class_name("entry-summary")
print(header.text)
finally:
driver.quit()
Here, in this piece of code, it provides a time buffer. because actually when we run the code and the page is not completely loaded then in that case, the driver does not find its specific element and it quits.
Congratulations!!! you are getting the grip of it, now let's move forward and do some more coding with Selenium using Python.
Page Navigating and Clicking Elements
from selenium.webdriver.common.keys import keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://techwithtim.net")
link = driver.find_element_by_link_text("Python Programming")
link.click()
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.LINK_TEXT, "Beginner Python Tutorials"))
)
element.click()
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "sow-button-19310003"))
)
element.click()
driver.back()
driver.back()
driver.back()
driver.forward()
driver.forward()
except:
driver.quit()
link = driver.find_element_by_link_text("Python Programming")
link.click()
This allows us to type the text that would show up for a link and then actually access the element from that. but be sure it’s clickable
and also make sure that when we go to a new page we are waiting for the element to exist before we can click on it, so we need to take care that the element exists before we try to click it.
driver.back()
driver.forward()
we can move back and forth respectively with the help of these commands.
PUTTING IT ALL TOGETHER
In conclusion, this blog aimed to introduce you to the world of Selenium automation using Python and empower you with the necessary tools and knowledge to get started. We explored the fundamentals of Selenium, its benefits, and how it can revolutionize your web testing and automation efforts.
Throughout this journey, we discussed the key components of Selenium, such as WebDriver, and learned how to set up a Selenium environment in Python. We covered basic operations like locating elements, interacting with web elements, handling different types of input fields, and navigating through web pages.