{"id":25528708,"date":"2022-06-17T16:41:22","date_gmt":"2022-06-17T11:11:22","guid":{"rendered":"https:\/\/entri.app\/blog\/?p=25528708"},"modified":"2022-06-17T16:41:22","modified_gmt":"2022-06-17T11:11:22","slug":"web-scraping-using-python-data-science-skills","status":"publish","type":"post","link":"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/","title":{"rendered":"Web Scraping Using Python: Data Science Skills"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_79_2 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a1f37967b9e3\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a1f37967b9e3\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#Getting_Started\" >Getting Started<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#Inspect_the_webpage\" >Inspect the webpage<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#Parse_the_webpage_html_using_Beautiful_Soup\" >Parse the webpage html using Beautiful Soup<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#Search_for_html_elements\" >Search for html elements<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#Looping_through_elements_and_saving_variables\" >Looping through elements and saving variables<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#Data_Cleaning\" >Data Cleaning<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#Writing_to_an_output_file\" >Writing to an output file<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#Summary\" >Summary<\/a><\/li><\/ul><\/nav><\/div>\n<div class=\"\">\n<h1 id=\"f3e6\" class=\"pw-post-title jj jk jl bn jm jn jo jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh gc\" data-selectable-paragraph=\"\"><\/h1>\n<\/div>\n<p id=\"49fc\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">Web Scraping is gathering data from websites using code, and is one of the most logical and easily accessible sources of data.<\/p>\n<p id=\"ef71\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">In this\u00a0 blog , we will go through a simple example of how to scrape a website to gather data on the top 100 companies. Automating this process with a web scraper avoids manual data gathering, saves time and also allows you to have all the data on the companies in one structured file.<\/p>\n<p data-selectable-paragraph=\"\"><a href=\"https:\/\/entri.sng.link\/Bcofz\/yeoy\/ojyv\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-25520910 size-full\" src=\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Square.png\" alt=\"Python and Machine Learning Square\" width=\"345\" height=\"345\" srcset=\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Square.png 345w, https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Square-300x300.png 300w, https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Square-150x150.png 150w, https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Square-24x24.png 24w, https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Square-48x48.png 48w, https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Square-96x96.png 96w, https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Square-75x75.png 75w\" sizes=\"auto, (max-width: 345px) 100vw, 345px\" \/><\/a><\/p>\n<h2 id=\"4a14\" class=\"mc li jl bn lj md me mf ln mg mh mi lr mj mk ml lu mm mn mo lx mp mq mr ma ms gc\"><span class=\"ez-toc-section\" id=\"Getting_Started\"><\/span><strong>Getting Started<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p id=\"4429\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">For web scraping there are a few different libraries to consider, including:<\/p>\n<ul class=\"\">\n<li id=\"505d\" class=\"my mz jl kk b kl km kp kq kt na kx nb lb nc lf nd ne nf ng gc\" data-selectable-paragraph=\"\">Beautiful Soup<\/li>\n<li data-selectable-paragraph=\"\">Requests<\/li>\n<li id=\"5592\" class=\"my mz jl kk b kl nh kp ni kt nj kx nk lb nl lf nd ne nf ng gc\" data-selectable-paragraph=\"\">Scrapy<\/li>\n<li id=\"b00f\" class=\"my mz jl kk b kl nh kp ni kt nj kx nk lb nl lf nd ne nf ng gc\" data-selectable-paragraph=\"\">Selenium<\/li>\n<\/ul>\n<p id=\"e97b\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">In this example we will be using Beautiful Soup. Using\u00a0<code class=\"fr nm nn no np b\">pip<\/code>, the Python package manager, you can install Beautiful Soup with the following:<\/p>\n<pre class=\"nq nr ns nt gz nu bt nv\"><span id=\"b531\" class=\"gc lh li jl np b do nw nx l ny\" data-selectable-paragraph=\"\">pip install BeautifulSoup4<\/span><\/pre>\n<p id=\"2681\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">With these libraries installed, let\u2019s get started!<\/p>\n<h4 style=\"text-align: center\"><strong><a href=\"https:\/\/entri.sng.link\/Bcofz\/yeoy\/ojyv\" target=\"_blank\" rel=\"noopener\">Grab the opportunity to learn Python with Entri! Click Here<\/a><\/strong><\/h4>\n<h2 id=\"06ca\" class=\"mc li jl bn lj md me mf ln mg mh mi lr mj mk ml lu mm mn mo lx mp mq mr ma ms gc\"><span class=\"ez-toc-section\" id=\"Inspect_the_webpage\"><\/span><strong>Inspect the webpage<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p id=\"3a7d\" class=\"pw-post-body-paragraph ki kj jl kk b kl mt kn ko kp mu kr ks kt mv kv kw kx mw kz la lb mx ld le lf it gc\" data-selectable-paragraph=\"\">To know which elements that you need to target in your python code, you need to first inspect the web page.<\/p>\n<p id=\"1ffa\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">To gather data from companies, you can inspect the page by right clicking on the element of interest and select inspect. This brings up the HTML code where we can see the element that each field is contained within.<\/p>\n<div class=\"nq nr ns nt gz o he\">\n<figure class=\"nz jc oa ob oc od oe paragraph-image\">\n<div class=\"jd je dq jf cf jg\" role=\"button\"><img loading=\"lazy\" decoding=\"async\" class=\"cf jh ji\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/2880\/1*bnAgwIK0qH4tE49im7mYHA.png\" alt=\"\" width=\"1440\" height=\"900\" \/><\/div>\n<\/figure>\n<figure class=\"nz jc oa ob oc od oe paragraph-image\">\n<div class=\"jd je dq jf cf jg\" role=\"button\"><img loading=\"lazy\" decoding=\"async\" class=\"cf jh ji\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/2880\/1*JAWO1QTCEzhgPI6njZFKLg.png\" alt=\"\" width=\"1440\" height=\"900\" \/><\/div><figcaption class=\"of bm gp gn go og oh bn b bo bp co oi dq oj ok\" data-selectable-paragraph=\"\">Right click on the element you are interested in and select \u2018Inspect\u2019, this brings up the html elements<\/figcaption><\/figure>\n<\/div>\n<p id=\"f475\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">Since the data is stored in a table, it will be straight forward to scrape with just a few lines of code. This is a good example and a good place to start if you want to familiarise yourself with scraping websites, but keep in mind that it will not always be so simple!<\/p>\n<p id=\"0181\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">All 100 results are contained within rows in\u00a0<code class=\"fr nm nn no np b\">&lt;tr&gt;<\/code> elements and these are all visible on one page. This will not always be the case and when results span over many pages you may need to either change the number of results displayed on a webpage, or loop over all pages to gather all the information.<\/p>\n<p id=\"4514\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">On the League Table webpage, a table containing 100 results is displayed. When inspecting the page it is easy to see a pattern in the html. The results are contained in rows within the table:<\/p>\n<pre class=\"nq nr ns nt gz nu bt nv\"><span id=\"10fb\" class=\"gc lh li jl np b do nw nx l ny\" data-selectable-paragraph=\"\">&lt;table class=\"tableSorter\"&gt;<\/span><\/pre>\n<p id=\"23b3\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">The repeated rows\u00a0<code class=\"fr nm nn no np b\">&lt;tr&gt;<\/code> will keep our code minimal by using a loop within python to find the data and to write a file!<\/p>\n<h4 style=\"text-align: center\"><strong><a href=\"https:\/\/entri.sng.link\/Bcofz\/yeoy\/ojyv\" target=\"_blank\" rel=\"noopener\">Entri gives you the best Coding experience<\/a><\/strong><\/h4>\n<h2 id=\"422c\" class=\"mc li jl bn lj md me mf ln mg mh mi lr mj mk ml lu mm mn mo lx mp mq mr ma ms gc\"><span class=\"ez-toc-section\" id=\"Parse_the_webpage_html_using_Beautiful_Soup\"><\/span><strong>Parse the webpage html using Beautiful Soup<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p id=\"833c\" class=\"pw-post-body-paragraph ki kj jl kk b kl mt kn ko kp mu kr ks kt mv kv kw kx mw kz la lb mx ld le lf it gc\" data-selectable-paragraph=\"\">Now that you have looked at the structure of the html and familiarised yourself with what you are scraping, it\u2019s time to get started with python!<\/p>\n<p id=\"74de\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">The first step is to import the libraries that you will be using for your web scraper. We have already talked about BeautifulSoup above, which helps us to handle the html. The next library we are importing is\u00a0<code class=\"fr nm nn no np b\">urllib<\/code>\u00a0which makes the connection to the webpage. Finally, we will be writing the output to a csv so we also need to import the\u00a0<code class=\"fr nm nn no np b\">csv<\/code>\u00a0library. As an alternative, the\u00a0<code class=\"fr nm nn no np b\">json<\/code>\u00a0library could be used here instead.<\/p>\n<pre class=\"nq nr ns nt gz nu bt nv\"><span id=\"bb4b\" class=\"gc lh li jl np b do nw nx l ny\" data-selectable-paragraph=\"\"># import libraries\r\nfrom bs4 import BeautifulSoup\r\nimport urllib.request\r\nimport csv<\/span><\/pre>\n<p id=\"20a0\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">The next step is to define the url that you are scraping. As discussed in the previous section, this webpage presents all results on one page so the full url as in the address bar is given here.<\/p>\n<pre class=\"nq nr ns nt gz nu bt nv\"><span id=\"5ee0\" class=\"gc lh li jl np b do nw nx l ny\" data-selectable-paragraph=\"\"># specify the url\r\nurlpage =  'http:\/\/www.fasttrack.co.uk\/league-tables\/tech-track-100\/league-table\/'<\/span><\/pre>\n<p id=\"0394\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">We then make the connection to the webpage and we can parse the html using BeautifulSoup, storing the object in the variable \u2018soup\u2019.<\/p>\n<pre class=\"nq nr ns nt gz nu bt nv\"><span id=\"a3e8\" class=\"gc lh li jl np b do nw nx l ny\" data-selectable-paragraph=\"\"># query the website and return the html to the variable 'page'\r\npage = urllib.request.urlopen(urlpage)\r\n# parse the html using beautiful soup and store in variable 'soup'\r\nsoup = BeautifulSoup(page, 'html.parser')<\/span><\/pre>\n<p id=\"9208\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">We can print the soup variable at this stage which should return the full parsed html of the webpage we have requested.<\/p>\n<pre class=\"nq nr ns nt gz nu bt nv\"><span id=\"c527\" class=\"gc lh li jl np b do nw nx l ny\" data-selectable-paragraph=\"\">print(soup)<\/span><\/pre>\n<p id=\"e72a\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">If there is an error or the variable is empty, then the request may not have been successful. You may wish to implement error handling at this point using the\u00a0<code class=\"fr nm nn no np b\">urllib.error<\/code>\u00a0module.<\/p>\n<h4 style=\"text-align: center\"><strong><a href=\"https:\/\/entri.sng.link\/Bcofz\/yeoy\/ojyv\" target=\"_blank\" rel=\"noopener\">Grab the opportunity to learn Python with Entri! Click Here<\/a><\/strong><\/h4>\n<h2 id=\"4ab3\" class=\"mc li jl bn lj md me mf ln mg mh mi lr mj mk ml lu mm mn mo lx mp mq mr ma ms gc\"><span class=\"ez-toc-section\" id=\"Search_for_html_elements\"><\/span><strong>Search for html elements<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p id=\"5150\" class=\"pw-post-body-paragraph ki kj jl kk b kl mt kn ko kp mu kr ks kt mv kv kw kx mw kz la lb mx ld le lf it gc\" data-selectable-paragraph=\"\">As all of the results are contained within a table, we can search the soup object for the table using the\u00a0<code class=\"fr nm nn no np b\">find<\/code>\u00a0method. We can then find each row within the table using the\u00a0<code class=\"fr nm nn no np b\">find_all<\/code>\u00a0method.<\/p>\n<p id=\"0385\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">If we print the number of rows we should get a result of 101, the 100 rows plus the header.<\/p>\n<pre class=\"nq nr ns nt gz nu bt nv\"><span id=\"fb05\" class=\"gc lh li jl np b do nw nx l ny\" data-selectable-paragraph=\"\"># find results within table\r\ntable = soup.find('table', attrs={'class': 'tableSorter'})\r\nresults = table.find_all('tr')\r\nprint('Number of results', len(results))<\/span><\/pre>\n<p id=\"769a\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">We can therefore loop over the results to gather the data.<\/p>\n<p id=\"5e68\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">Printing the first 2 rows in the soup object, we can see that the structure of each row is:<\/p>\n<pre class=\"nq nr ns nt gz nu bt nv\"><span id=\"b20b\" class=\"gc lh li jl np b do nw nx l ny\" data-selectable-paragraph=\"\">&lt;tr&gt;\r\n&lt;th&gt;Rank&lt;\/th&gt;\r\n&lt;th&gt;Company&lt;\/th&gt;\r\n&lt;th class=\"\"&gt;Location&lt;\/th&gt;\r\n&lt;th class=\"no-word-wrap\"&gt;Year end&lt;\/th&gt;\r\n&lt;th class=\"\" style=\"text-align:right;\"&gt;Annual sales rise over 3 years&lt;\/th&gt;\r\n&lt;th class=\"\" style=\"text-align:right;\"&gt;Latest sales \u00a3000s&lt;\/th&gt;\r\n&lt;th class=\"\" style=\"text-align:right;\"&gt;Staff&lt;\/th&gt;\r\n&lt;th class=\"\"&gt;Comment&lt;\/th&gt;\r\n&lt;!--                            &lt;th&gt;FYE&lt;\/th&gt;--&gt;\r\n&lt;\/tr&gt;\r\n&lt;tr&gt;\r\n&lt;td&gt;1&lt;\/td&gt;\r\n&lt;td&gt;&lt;a href=\"http:\/\/www.fasttrack.co.uk\/company_profile\/wonderbly-3\/\"&gt;&lt;span class=\"company-name\"&gt;Wonderbly&lt;\/span&gt;&lt;\/a&gt;Personalised children's books&lt;\/td&gt;\r\n&lt;td&gt;East London&lt;\/td&gt;\r\n&lt;td&gt;Apr-17&lt;\/td&gt;\r\n&lt;td style=\"text-align:right;\"&gt;294.27%&lt;\/td&gt;\r\n&lt;td style=\"text-align:right;\"&gt;*25,860&lt;\/td&gt;\r\n&lt;td style=\"text-align:right;\"&gt;80&lt;\/td&gt;\r\n&lt;td&gt;Has sold nearly 3m customisable children\u2019s books in 200 countries&lt;\/td&gt;\r\n&lt;!--                                            &lt;td&gt;Apr-17&lt;\/td&gt;--&gt;\r\n&lt;\/tr&gt;<\/span><\/pre>\n<p id=\"115f\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">There are 8 columns in the table containing: Rank, Company, Location, Year End, Annual Sales Rise, Latest Sales, Staff and Comments, all of which are interesting data that we can save.<\/p>\n<p id=\"18a5\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">This structure is consistent throughout all rows on the webpage, and therefore we can again use the <code class=\"fr nm nn no np b\">find_all<\/code>\u00a0method to assign each column to a variable that we can write to a csv or JSON by searching for the\u00a0<code class=\"fr nm nn no np b\">&lt;td&gt;<\/code>\u00a0element.<\/p>\n<h4 style=\"text-align: center\"><strong><a href=\"https:\/\/entri.sng.link\/Bcofz\/yeoy\/ojyv\" target=\"_blank\" rel=\"noopener\">Entri gives you the best Coding experience<\/a><\/strong><\/h4>\n<h2 id=\"28e8\" class=\"mc li jl bn lj md me mf ln mg mh mi lr mj mk ml lu mm mn mo lx mp mq mr ma ms gc\"><span class=\"ez-toc-section\" id=\"Looping_through_elements_and_saving_variables\"><\/span><strong>Looping through elements and saving variables<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p id=\"a467\" class=\"pw-post-body-paragraph ki kj jl kk b kl mt kn ko kp mu kr ks kt mv kv kw kx mw kz la lb mx ld le lf it gc\" data-selectable-paragraph=\"\">In python, it is useful to append the results to a list to then write the data to a file. We should declare the list and set the headers of the csv before the loop with the following:<\/p>\n<pre class=\"nq nr ns nt gz nu bt nv\"><span id=\"3139\" class=\"gc lh li jl np b do nw nx l ny\" data-selectable-paragraph=\"\"># create and write headers to a list \r\nrows = []\r\nrows.append(['Rank', 'Company Name', 'Webpage', 'Description', 'Location', 'Year end', 'Annual sales rise over 3 years', 'Sales \u00a3000s', 'Staff', 'Comments'])\r\nprint(rows)<\/span><\/pre>\n<p id=\"f51b\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">This will print out the first row that we have added to the list containing the headers.<\/p>\n<p id=\"8723\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">You might notice that there are a few extra fields\u00a0<code class=\"fr nm nn no np b\">Webpage<\/code>\u00a0and\u00a0<code class=\"fr nm nn no np b\">Description<\/code>\u00a0which are not column names in the table, but if you take a closer look in the html from when we printed the soup variable above, the second row contains more than just the company name. We can use some further extraction to get this extra information.<\/p>\n<p id=\"3b27\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">The next step is to loop over the results, process the data and append to\u00a0<code class=\"fr nm nn no np b\">rows\u00a0<\/code>which can be written to a csv.<\/p>\n<p id=\"ff5f\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">To find the results in the loop:<\/p>\n<pre class=\"nq nr ns nt gz nu bt nv\"><span id=\"11b1\" class=\"gc lh li jl np b do nw nx l ny\" data-selectable-paragraph=\"\"># loop over results\r\nfor result in results:\r\n    # find all columns per result\r\n    data = result.find_all('td')\r\n    # check that columns have data \r\n    if len(data) == 0: \r\n        continue<\/span><\/pre>\n<p id=\"0a26\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">Since the first row in the table contains only the headers, we can skip this result, as shown above. It also does not contain any\u00a0<code class=\"fr nm nn no np b\">&lt;td&gt;<\/code>\u00a0elements so when searching for the element, nothing is returned. We can then check that only results containing data are processed by requiring the length of the data to be non-zero.<\/p>\n<p id=\"f470\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">We can then start to process the data and save to variables.<\/p>\n<pre class=\"nq nr ns nt gz nu bt nv\"><span id=\"7991\" class=\"gc lh li jl np b do nw nx l ny\" data-selectable-paragraph=\"\">    # write columns to variables\r\n    rank = data[0].getText()\r\n    company = data[1].getText()\r\n    location = data[2].getText()\r\n    yearend = data[3].getText()\r\n    salesrise = data[4].getText()\r\n    sales = data[5].getText()\r\n    staff = data[6].getText()\r\n    comments = data[7].getText()<\/span><\/pre>\n<p id=\"6db6\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">The above simply gets the text from each of the columns and saves to variables. Some of this data however needs further cleaning to remove unwanted characters or extract further information.<\/p>\n<h4 style=\"text-align: center\"><strong><a href=\"https:\/\/entri.sng.link\/Bcofz\/yeoy\/ojyv\" target=\"_blank\" rel=\"noopener\">Grab the opportunity to learn Python with Entri! Click Here<\/a><\/strong><\/h4>\n<h2 id=\"7318\" class=\"mc li jl bn lj md me mf ln mg mh mi lr mj mk ml lu mm mn mo lx mp mq mr ma ms gc\"><span class=\"ez-toc-section\" id=\"Data_Cleaning\"><\/span><strong>Data Cleaning<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p id=\"dccd\" class=\"pw-post-body-paragraph ki kj jl kk b kl mt kn ko kp mu kr ks kt mv kv kw kx mw kz la lb mx ld le lf it gc\" data-selectable-paragraph=\"\">If we print out the variable\u00a0<code class=\"fr nm nn no np b\">company<\/code>, the text not only contains the name of the company but also a description. If we then print out\u00a0<code class=\"fr nm nn no np b\">sales<\/code>, it contains unwanted characters such as footnote symbols that would be useful to remove.<\/p>\n<pre class=\"nq nr ns nt gz nu bt nv\"><span id=\"4c19\" class=\"gc lh li jl np b do nw nx l ny\" data-selectable-paragraph=\"\">    print('Company is', company)\r\n    # Company is WonderblyPersonalised children's books          \r\n    print('Sales', sales)\r\n    # Sales *25,860<\/span><\/pre>\n<p id=\"5083\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">We would like to split\u00a0<code class=\"fr nm nn no np b\">company<\/code>\u00a0into the company name and the description which we can do in a few lines of code. Looking again at the html, for this column there is a\u00a0<code class=\"fr nm nn no np b\">&lt;span&gt;<\/code>\u00a0element that contains only the company name. There is also a link in this column to another page on the website that has more detailed information about the company. We will be using this a little later!<\/p>\n<h4 style=\"text-align: center\"><strong><a href=\"https:\/\/entri.sng.link\/Bcofz\/yeoy\/ojyv\" target=\"_blank\" rel=\"noopener\">Entri gives you the best Coding experience<\/a><\/strong><\/h4>\n<pre class=\"nq nr ns nt gz nu bt nv\"><span id=\"8fc7\" class=\"gc lh li jl np b do nw nx l ny\" data-selectable-paragraph=\"\">&lt;td&gt;&lt;a href=\"http:\/\/www.fasttrack.co.uk\/company_profile\/wonderbly-3\/\"&gt;&lt;span class=\"company-name\"&gt;Wonderbly&lt;\/span&gt;&lt;\/a&gt;Personalised children's books&lt;\/td&gt;<\/span><\/pre>\n<p id=\"729c\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">To separate\u00a0<code class=\"fr nm nn no np b\">company<\/code>\u00a0into two fields, we can use the<code class=\"fr nm nn no np b\">find<\/code>\u00a0method to save the\u00a0<code class=\"fr nm nn no np b\">&lt;span&gt;<\/code>\u00a0element and then use either\u00a0<code class=\"fr nm nn no np b\">strip<\/code>\u00a0or\u00a0<code class=\"fr nm nn no np b\">replace<\/code>\u00a0to remove the company name from the\u00a0<code class=\"fr nm nn no np b\">company<\/code>\u00a0variable, so that it leaves only the description.<br \/>\nTo remove the unwanted characters from\u00a0<code class=\"fr nm nn no np b\">sales<\/code>, we can again use<code class=\"fr nm nn no np b\">strip<\/code>\u00a0and\u00a0<code class=\"fr nm nn no np b\">replace<\/code>\u00a0methods!<\/p>\n<pre class=\"nq nr ns nt gz nu bt nv\"><span id=\"0f3f\" class=\"gc lh li jl np b do nw nx l ny\" data-selectable-paragraph=\"\">    # extract description from the name\r\n    companyname = data[1].find('span', attrs={'class':'company-name'}).getText()    \r\n    description = company.replace(companyname, '')\r\n    \r\n    # remove unwanted characters\r\n    sales = sales.strip('*').strip('\u2020').replace(',','')<\/span><\/pre>\n<p id=\"9314\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">The last variable we would like to save is the company website. As discussed above, the second column contains a link to another page that has an overview of each company. Each company page has it\u2019s own table, which most of the time contains the company website.<\/p>\n<figure class=\"nq nr ns nt gz jc gn go paragraph-image\">\n<div class=\"jd je dq jf cf jg\" role=\"button\">\n<div class=\"gn go om\"><img loading=\"lazy\" decoding=\"async\" class=\"cf jh ji\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/1400\/1*BDxmEE0Vka_Dqq_78i_X7w.png\" alt=\"\" width=\"700\" height=\"438\" \/><\/div>\n<\/div><figcaption class=\"of bm gp gn go og oh bn b bo bp co\" data-selectable-paragraph=\"\">Inspecting the element of the url on the company page<\/figcaption><\/figure>\n<h4 style=\"text-align: center\"><strong><a href=\"https:\/\/entri.sng.link\/Bcofz\/yeoy\/ojyv\" target=\"_blank\" rel=\"noopener\">Grab the opportunity to learn Python with Entri! Click Here<\/a><\/strong><\/h4>\n<p id=\"c379\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">To scrape the url from each table and save it as a variable, we need to use the same steps as above:<\/p>\n<ul class=\"\">\n<li id=\"f186\" class=\"my mz jl kk b kl km kp kq kt na kx nb lb nc lf nd ne nf ng gc\" data-selectable-paragraph=\"\">Find the element that has the url of the the company page on the fast track website<\/li>\n<li id=\"9036\" class=\"my mz jl kk b kl nh kp ni kt nj kx nk lb nl lf nd ne nf ng gc\" data-selectable-paragraph=\"\">Make a request to each company page url<\/li>\n<li id=\"5f7a\" class=\"my mz jl kk b kl nh kp ni kt nj kx nk lb nl lf nd ne nf ng gc\" data-selectable-paragraph=\"\">Parse the html using Beautifulsoup<\/li>\n<li id=\"3bba\" class=\"my mz jl kk b kl nh kp ni kt nj kx nk lb nl lf nd ne nf ng gc\" data-selectable-paragraph=\"\">Find the elements of interest<\/li>\n<\/ul>\n<p id=\"f4ed\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">Looking at a few of the company pages, as in the screenshot above, the urls are in last row in the table so we can search within the last row for the\u00a0<code class=\"fr nm nn no np b\">&lt;a&gt;<\/code>\u00a0element.<\/p>\n<pre class=\"nq nr ns nt gz nu bt nv\"><span id=\"4449\" class=\"gc lh li jl np b do nw nx l ny\" data-selectable-paragraph=\"\">    # go to link and extract company website\r\n    url = data[1].find('a').get('href')\r\n    page = urllib.request.urlopen(url)\r\n    # parse the html \r\n    soup = BeautifulSoup(page, 'html.parser')\r\n    # find the last result in the table and get the link\r\n    try:\r\n        tableRow = soup.find('table').find_all('tr')[-1]\r\n        webpage = tableRow.find('a').get('href')\r\n    except:\r\n        webpage = None<\/span><\/pre>\n<p id=\"7d00\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">There also may be cases where the company website is not displayed so we can use a\u00a0<code class=\"fr nm nn no np b\">try<\/code>\u00a0<code class=\"fr nm nn no np b\">except<\/code>\u00a0condition, in case a url is not found.<\/p>\n<p id=\"933c\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">Once we have saved all of the data to variables, still within the loop, we can add each result to the list\u00a0<code class=\"fr nm nn no np b\">rows<\/code>.<\/p>\n<pre class=\"nq nr ns nt gz nu bt nv\"><span id=\"aba3\" class=\"gc lh li jl np b do nw nx l ny\" data-selectable-paragraph=\"\">    # write each result to rows\r\n    rows.append([rank, companyname, webpage, description, location, yearend, salesrise, sales, staff, comments])<\/span><span id=\"0bc0\" class=\"gc lh li jl np b do on oo op oq or nx l ny\" data-selectable-paragraph=\"\">print(rows)<\/span><\/pre>\n<h4 style=\"text-align: center\"><strong><a href=\"https:\/\/entri.sng.link\/Bcofz\/yeoy\/ojyv\" target=\"_blank\" rel=\"noopener\">Entri gives you the best Coding experience<\/a><\/strong><\/h4>\n<h2 id=\"484e\" class=\"mc li jl bn lj md me mf ln mg mh mi lr mj mk ml lu mm mn mo lx mp mq mr ma ms gc\"><span class=\"ez-toc-section\" id=\"Writing_to_an_output_file\"><\/span><strong>Writing to an output file<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p id=\"e7f5\" class=\"pw-post-body-paragraph ki kj jl kk b kl mt kn ko kp mu kr ks kt mv kv kw kx mw kz la lb mx ld le lf it gc\" data-selectable-paragraph=\"\">You may want to save this data for analysis and this can be done very simply within python from our list.<\/p>\n<pre class=\"nq nr ns nt gz nu bt nv\"><span id=\"f206\" class=\"gc lh li jl np b do nw nx l ny\" data-selectable-paragraph=\"\"># Create csv and write rows to output file\r\nwith open('techtrack100.csv','w', newline='') as f_output:\r\n    csv_output = csv.writer(f_output)\r\n    csv_output.writerows(rows)<\/span><\/pre>\n<p id=\"92cf\" class=\"pw-post-body-paragraph ki kj jl kk b kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld le lf it gc\" data-selectable-paragraph=\"\">When running the python script your output file will be generated containing 100 rows of results that you can look at in further detail.<\/p>\n<h4 style=\"text-align: center\"><strong><a href=\"https:\/\/entri.sng.link\/Bcofz\/yeoy\/ojyv\" target=\"_blank\" rel=\"noopener\">Grab the opportunity to learn Python with Entri! Click Here<\/a><\/strong><\/h4>\n<h2 id=\"60ef\" class=\"mc li jl bn lj md me mf ln mg mh mi lr mj mk ml lu mm mn mo lx mp mq mr ma ms gc\"><span class=\"ez-toc-section\" id=\"Summary\"><\/span><strong>Summary<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p id=\"10ca\" class=\"pw-post-body-paragraph ki kj jl kk b kl mt kn ko kp mu kr ks kt mv kv kw kx mw kz la lb mx ld le lf it gc\" data-selectable-paragraph=\"\">This brief tutorial on web scraping with python has outlined:<\/p>\n<ul class=\"\">\n<li id=\"7a7c\" class=\"my mz jl kk b kl km kp kq kt na kx nb lb nc lf nd ne nf ng gc\" data-selectable-paragraph=\"\">Connecting to a webpage<\/li>\n<li id=\"3cdb\" class=\"my mz jl kk b kl nh kp ni kt nj kx nk lb nl lf nd ne nf ng gc\" data-selectable-paragraph=\"\">Parsing html using BeautifulSoup<\/li>\n<li id=\"4479\" class=\"my mz jl kk b kl nh kp ni kt nj kx nk lb nl lf nd ne nf ng gc\" data-selectable-paragraph=\"\">Looping through the soup object to find elements<\/li>\n<li id=\"34a4\" class=\"my mz jl kk b kl nh kp ni kt nj kx nk lb nl lf nd ne nf ng gc\" data-selectable-paragraph=\"\">Performing some simple data cleaning<\/li>\n<li id=\"f534\" class=\"my mz jl kk b kl nh kp ni kt nj kx nk lb nl lf nd ne nf ng gc\" data-selectable-paragraph=\"\">Writing data to a csv<\/li>\n<\/ul>\n<p><a href=\"https:\/\/entri.sng.link\/Bcofz\/yeoy\/ojyv\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-25522670 size-full\" src=\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Rectangle-1.png\" alt=\"Python and Machine Learning Rectangle\" width=\"970\" height=\"250\" srcset=\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Rectangle-1.png 970w, https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Rectangle-1-300x77.png 300w, https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Rectangle-1-768x198.png 768w, https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Rectangle-1-750x193.png 750w\" sizes=\"auto, (max-width: 970px) 100vw, 970px\" \/><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Web Scraping is gathering data from websites using code, and is one of the most logical and easily accessible sources of data. In this\u00a0 blog , we will go through a simple example of how to scrape a website to gather data on the top 100 companies. Automating this process with a web scraper avoids [&hellip;]<\/p>\n","protected":false},"author":111,"featured_media":25529000,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[802,1888],"tags":[],"class_list":["post-25528708","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-articles","category-python-programming"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.6 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Web Scraping Using Python: Data Science Skills - Entri Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Web Scraping Using Python: Data Science Skills - Entri Blog\" \/>\n<meta property=\"og:description\" content=\"Web Scraping is gathering data from websites using code, and is one of the most logical and easily accessible sources of data. In this\u00a0 blog , we will go through a simple example of how to scrape a website to gather data on the top 100 companies. Automating this process with a web scraper avoids [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/\" \/>\n<meta property=\"og:site_name\" content=\"Entri Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/entri.me\/\" \/>\n<meta property=\"article:published_time\" content=\"2022-06-17T11:11:22+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/06\/Web-Scraping-Using-Python-Data-Science-Skills.png\" \/>\n\t<meta property=\"og:image:width\" content=\"820\" \/>\n\t<meta property=\"og:image:height\" content=\"615\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Feeba Mahin\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@entri_app\" \/>\n<meta name=\"twitter:site\" content=\"@entri_app\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Feeba Mahin\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/\"},\"author\":{\"name\":\"Feeba Mahin\",\"@id\":\"https:\/\/entri.app\/blog\/#\/schema\/person\/f036dab84abae3dcc9390a1110d95d36\"},\"headline\":\"Web Scraping Using Python: Data Science Skills\",\"datePublished\":\"2022-06-17T11:11:22+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/\"},\"wordCount\":1514,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/entri.app\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/06\/Web-Scraping-Using-Python-Data-Science-Skills.png\",\"articleSection\":[\"Articles\",\"Python Programming\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/\",\"url\":\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/\",\"name\":\"Web Scraping Using Python: Data Science Skills - Entri Blog\",\"isPartOf\":{\"@id\":\"https:\/\/entri.app\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/06\/Web-Scraping-Using-Python-Data-Science-Skills.png\",\"datePublished\":\"2022-06-17T11:11:22+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#primaryimage\",\"url\":\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/06\/Web-Scraping-Using-Python-Data-Science-Skills.png\",\"contentUrl\":\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/06\/Web-Scraping-Using-Python-Data-Science-Skills.png\",\"width\":820,\"height\":615,\"caption\":\"Web Scraping Using Python Data Science Skills\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/entri.app\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Python Programming\",\"item\":\"https:\/\/entri.app\/blog\/category\/python-programming\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Web Scraping Using Python: Data Science Skills\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/entri.app\/blog\/#website\",\"url\":\"https:\/\/entri.app\/blog\/\",\"name\":\"Entri Blog\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/entri.app\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/entri.app\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/entri.app\/blog\/#organization\",\"name\":\"Entri App\",\"url\":\"https:\/\/entri.app\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/entri.app\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2019\/10\/Entri-Logo-1.png\",\"contentUrl\":\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2019\/10\/Entri-Logo-1.png\",\"width\":989,\"height\":446,\"caption\":\"Entri App\"},\"image\":{\"@id\":\"https:\/\/entri.app\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/entri.me\/\",\"https:\/\/x.com\/entri_app\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/entri.app\/blog\/#\/schema\/person\/f036dab84abae3dcc9390a1110d95d36\",\"name\":\"Feeba Mahin\",\"url\":\"https:\/\/entri.app\/blog\/author\/feeba123\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Web Scraping Using Python: Data Science Skills - Entri Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/","og_locale":"en_US","og_type":"article","og_title":"Web Scraping Using Python: Data Science Skills - Entri Blog","og_description":"Web Scraping is gathering data from websites using code, and is one of the most logical and easily accessible sources of data. In this\u00a0 blog , we will go through a simple example of how to scrape a website to gather data on the top 100 companies. Automating this process with a web scraper avoids [&hellip;]","og_url":"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/","og_site_name":"Entri Blog","article_publisher":"https:\/\/www.facebook.com\/entri.me\/","article_published_time":"2022-06-17T11:11:22+00:00","og_image":[{"width":820,"height":615,"url":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/06\/Web-Scraping-Using-Python-Data-Science-Skills.png","type":"image\/png"}],"author":"Feeba Mahin","twitter_card":"summary_large_image","twitter_creator":"@entri_app","twitter_site":"@entri_app","twitter_misc":{"Written by":"Feeba Mahin","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#article","isPartOf":{"@id":"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/"},"author":{"name":"Feeba Mahin","@id":"https:\/\/entri.app\/blog\/#\/schema\/person\/f036dab84abae3dcc9390a1110d95d36"},"headline":"Web Scraping Using Python: Data Science Skills","datePublished":"2022-06-17T11:11:22+00:00","mainEntityOfPage":{"@id":"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/"},"wordCount":1514,"commentCount":0,"publisher":{"@id":"https:\/\/entri.app\/blog\/#organization"},"image":{"@id":"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#primaryimage"},"thumbnailUrl":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/06\/Web-Scraping-Using-Python-Data-Science-Skills.png","articleSection":["Articles","Python Programming"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/","url":"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/","name":"Web Scraping Using Python: Data Science Skills - Entri Blog","isPartOf":{"@id":"https:\/\/entri.app\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#primaryimage"},"image":{"@id":"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#primaryimage"},"thumbnailUrl":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/06\/Web-Scraping-Using-Python-Data-Science-Skills.png","datePublished":"2022-06-17T11:11:22+00:00","breadcrumb":{"@id":"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#primaryimage","url":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/06\/Web-Scraping-Using-Python-Data-Science-Skills.png","contentUrl":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/06\/Web-Scraping-Using-Python-Data-Science-Skills.png","width":820,"height":615,"caption":"Web Scraping Using Python Data Science Skills"},{"@type":"BreadcrumbList","@id":"https:\/\/entri.app\/blog\/web-scraping-using-python-data-science-skills\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/entri.app\/blog\/"},{"@type":"ListItem","position":2,"name":"Python Programming","item":"https:\/\/entri.app\/blog\/category\/python-programming\/"},{"@type":"ListItem","position":3,"name":"Web Scraping Using Python: Data Science Skills"}]},{"@type":"WebSite","@id":"https:\/\/entri.app\/blog\/#website","url":"https:\/\/entri.app\/blog\/","name":"Entri Blog","description":"","publisher":{"@id":"https:\/\/entri.app\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/entri.app\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/entri.app\/blog\/#organization","name":"Entri App","url":"https:\/\/entri.app\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/entri.app\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2019\/10\/Entri-Logo-1.png","contentUrl":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2019\/10\/Entri-Logo-1.png","width":989,"height":446,"caption":"Entri App"},"image":{"@id":"https:\/\/entri.app\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/entri.me\/","https:\/\/x.com\/entri_app"]},{"@type":"Person","@id":"https:\/\/entri.app\/blog\/#\/schema\/person\/f036dab84abae3dcc9390a1110d95d36","name":"Feeba Mahin","url":"https:\/\/entri.app\/blog\/author\/feeba123\/"}]}},"_links":{"self":[{"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/posts\/25528708","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/users\/111"}],"replies":[{"embeddable":true,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/comments?post=25528708"}],"version-history":[{"count":5,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/posts\/25528708\/revisions"}],"predecessor-version":[{"id":25529002,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/posts\/25528708\/revisions\/25529002"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/media\/25529000"}],"wp:attachment":[{"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/media?parent=25528708"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/categories?post=25528708"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/tags?post=25528708"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}