{"id":25523403,"date":"2022-05-08T04:42:57","date_gmt":"2022-05-07T23:12:57","guid":{"rendered":"https:\/\/entri.app\/blog\/?p=25523403"},"modified":"2023-11-22T17:33:42","modified_gmt":"2023-11-22T12:03:42","slug":"data-science-skills-web-scraping-using-python","status":"publish","type":"post","link":"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/","title":{"rendered":"Data Science Skills: Web Scraping Using Python"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_79_2 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-69e1137fdf3fd\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-69e1137fdf3fd\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-1'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#One_of_the_first_tasks_that_I_was_given_in_my_job_as_a_Data_Scientist_involved_Web_Scraping_This_was_a_completely_unfamiliar_concept_to_me_at_the_time_gathering_data_from_websites_using_code_but_it_is_one_of_the_most_logical_and_easily_accessible_sources_of_data\" >One of the first tasks that I was given in my job as a Data Scientist involved Web Scraping. This was a completely unfamiliar concept to me at the time, gathering data from websites using code, but it is one of the most logical and easily accessible sources of data.<\/a><ul class='ez-toc-list-level-2' ><li class='ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#Getting_Started\" >Getting Started<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#Inspect_the_webpage\" >Inspect the webpage<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#Parse_the_webpage_html_using_Beautiful_Soup\" >Parse the webpage html using Beautiful Soup<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#unlock_full_stack_developer_skills_enroll_for_free_demo_video\" >unlock full stack developer skills ! enroll for free demo video !!<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#Search_for_html_elements\" >Search for html elements<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#Looping_through_elements_and_saving_variables\" >Looping through elements and saving variables<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#Crack_the_code_learn_python_in_malayalam\" >Crack the code : learn python in malayalam !<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#Data_Cleaning\" >Data Cleaning<\/a><\/li><\/ul><\/li><\/ul><\/nav><\/div>\n<div class=\"\">\n<h1 id=\"f3e6\" class=\"pw-post-title jr js jt bn ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp gc\" data-selectable-paragraph=\"\"><span class=\"ez-toc-section\" id=\"One_of_the_first_tasks_that_I_was_given_in_my_job_as_a_Data_Scientist_involved_Web_Scraping_This_was_a_completely_unfamiliar_concept_to_me_at_the_time_gathering_data_from_websites_using_code_but_it_is_one_of_the_most_logical_and_easily_accessible_sources_of_data\"><\/span><span style=\"color: #333333; font-size: 15px;\">One of the first tasks that I was given in my job as a Data Scientist involved Web Scraping. This was a completely unfamiliar concept to me at the time, gathering data from websites using code, but it is one of the most logical and easily accessible sources of data.<\/span><span class=\"ez-toc-section-end\"><\/span><\/h1>\n<\/div>\n<p data-selectable-paragraph=\"\"><a href=\"https:\/\/entri.app\/course\/python-programming-course\/\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-25522670 size-full\" src=\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Rectangle-1.png\" alt=\"Python and Machine Learning Rectangle\" width=\"970\" height=\"250\" srcset=\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Rectangle-1.png 970w, https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Rectangle-1-300x77.png 300w, https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Rectangle-1-768x198.png 768w, https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Python-and-Machine-Learning-Rectangle-1-750x193.png 750w\" sizes=\"auto, (max-width: 970px) 100vw, 970px\" \/><\/a><\/p>\n<h2 id=\"4a14\" class=\"mk lq jt bn lr ml mm mn lv mo mp mq lz mr ms mt mc mu mv mw mf mx my mz mi na gc\"><span class=\"ez-toc-section\" id=\"Getting_Started\"><\/span><strong>Getting Started<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p id=\"d6ca\" class=\"pw-post-body-paragraph kq kr jt ks b kt nb kv kw kx nc kz la lb nd ld le lf ne lh li lj nf ll lm ln jb gc\" data-selectable-paragraph=\"\">The first question to ask before getting started with any python application is \u2018Which libraries do I need?\u2019<\/p>\n<p id=\"4429\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">For web scraping there are a few different libraries to consider, including:<\/p>\n<ul class=\"\">\n<li id=\"505d\" class=\"ng nh jt ks b kt ku kx ky lb ni lf nj lj nk ln nl nm nn no gc\" data-selectable-paragraph=\"\">Beautiful Soup<\/li>\n<li data-selectable-paragraph=\"\">Requests<\/li>\n<li id=\"5592\" class=\"ng nh jt ks b kt np kx nq lb nr lf ns lj nt ln nl nm nn no gc\" data-selectable-paragraph=\"\">Scrapy<\/li>\n<li id=\"b00f\" class=\"ng nh jt ks b kt np kx nq lb nr lf ns lj nt ln nl nm nn no gc\" data-selectable-paragraph=\"\">Selenium<\/li>\n<\/ul>\n<p id=\"e97b\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">In this example we will be using Beautiful Soup. Using\u00a0<code class=\"fr nu nv nw nx b\">pip<\/code>, the Python package manager, you can install Beautiful Soup with the following:<\/p>\n<pre class=\"ny nz oa ob gz oc bt od\"><span id=\"b531\" class=\"gc lp lq jt nx b do oe of l og\" data-selectable-paragraph=\"\">pip install BeautifulSoup4<\/span><\/pre>\n<p id=\"2681\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">With these libraries installed, let\u2019s get started!<\/p>\n<h2 id=\"06ca\" class=\"mk lq jt bn lr ml mm mn lv mo mp mq lz mr ms mt mc mu mv mw mf mx my mz mi na gc\"><span class=\"ez-toc-section\" id=\"Inspect_the_webpage\"><\/span><strong>Inspect the webpage<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p id=\"3a7d\" class=\"pw-post-body-paragraph kq kr jt ks b kt nb kv kw kx nc kz la lb nd ld le lf ne lh li lj nf ll lm ln jb gc\" data-selectable-paragraph=\"\">To know which elements that you need to target in your python code, you need to first inspect the web page.<\/p>\n<p id=\"1ffa\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">To gather data , you can inspect the page by right clicking on the element of interest and select inspect. This brings up the HTML code where we can see the element that each field is contained within. Right click on the element you are interested in and select \u2018Inspect\u2019, this brings up the html elements<\/p>\n<p id=\"f475\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">Since the data will be stored in a table, it will be straight forward to scrape with just a few lines of code. If you want to familiarise yourself with scraping websites, bear in mind that it will not always be so simple!<\/p>\n<h4 style=\"text-align: center;\"><strong><a href=\"https:\/\/entri.app\/course\/data-science-and-machine-learning-course\/\" target=\"_blank\" rel=\"noopener\">Grab the opportunity to learn data science with Entri! Click Here<\/a><\/strong><\/h4>\n<h2 id=\"422c\" class=\"mk lq jt bn lr ml mm mn lv mo mp mq lz mr ms mt mc mu mv mw mf mx my mz mi na gc\"><span class=\"ez-toc-section\" id=\"Parse_the_webpage_html_using_Beautiful_Soup\"><\/span><strong>Parse the webpage html using Beautiful Soup<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p id=\"833c\" class=\"pw-post-body-paragraph kq kr jt ks b kt nb kv kw kx nc kz la lb nd ld le lf ne lh li lj nf ll lm ln jb gc\" data-selectable-paragraph=\"\">Now that you have looked at the structure of the html and familiarised yourself with what you are scraping, it\u2019s time to get started with python!<\/p>\n<p id=\"74de\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">The first step is to import the libraries that you will be using for your web scraper. We have already talked about BeautifulSoup above, which helps us to handle the html. The next library we are importing is\u00a0<code class=\"fr nu nv nw nx b\">urllib<\/code>\u00a0which makes the connection to the webpage. Finally, we will be writing the output to a csv so we also need to import the\u00a0<code class=\"fr nu nv nw nx b\">csv<\/code>\u00a0library. As an alternative, the\u00a0<code class=\"fr nu nv nw nx b\">json<\/code>\u00a0library could be used here instead.<\/p>\n<pre class=\"ny nz oa ob gz oc bt od\"><span id=\"bb4b\" class=\"gc lp lq jt nx b do oe of l og\" data-selectable-paragraph=\"\"># import libraries\r\nfrom bs4 import BeautifulSoup\r\nimport urllib.request\r\nimport csv<\/span><\/pre>\n<p id=\"20a0\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">The next step is to define the url that you are scraping. As discussed in the previous section, this webpage presents all results on one page so the full url as in the address bar is given here.<\/p>\n<pre class=\"ny nz oa ob gz oc bt od\"><span id=\"5ee0\" class=\"gc lp lq jt nx b do oe of l og\" data-selectable-paragraph=\"\"># specify the url\r\nurlpage =  'http:\/\/www.fasttrack.co.uk\/league-tables\/tech-track-100\/league-table\/'<\/span><\/pre>\n<p id=\"0394\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">We then make the connection to the webpage and we can parse the html using BeautifulSoup, storing the object in the variable \u2018soup\u2019.<\/p>\n<pre class=\"ny nz oa ob gz oc bt od\"><span id=\"a3e8\" class=\"gc lp lq jt nx b do oe of l og\" data-selectable-paragraph=\"\"># query the website and return the html to the variable 'page'\r\npage = urllib.request.urlopen(urlpage)\r\n# parse the html using beautiful soup and store in variable 'soup'\r\nsoup = BeautifulSoup(page, 'html.parser')<\/span><\/pre>\n<p id=\"9208\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">We can print the soup variable at this stage which should return the full parsed html of the webpage we have requested.<\/p>\n<pre class=\"ny nz oa ob gz oc bt od\"><span id=\"c527\" class=\"gc lp lq jt nx b do oe of l og\" data-selectable-paragraph=\"\">print(soup)<\/span><\/pre>\n<p id=\"e72a\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">If there is an error or the variable is empty, then the request may not have been successful. You may wish to implement error handling at this point using the\u00a0<code class=\"fr nu nv nw nx b\">urllib.error<\/code>\u00a0module.<\/p>\n<h2 style=\"text-align: center;\"><span class=\"ez-toc-section\" id=\"unlock_full_stack_developer_skills_enroll_for_free_demo_video\"><\/span><a class=\"btn btn-default\" href=\"https:\/\/entri.app\/course\/full-stack-developer-course\/\">unlock full stack developer skills ! enroll for free demo video !!<\/a><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<h2 id=\"4ab3\" class=\"mk lq jt bn lr ml mm mn lv mo mp mq lz mr ms mt mc mu mv mw mf mx my mz mi na gc\"><span class=\"ez-toc-section\" id=\"Search_for_html_elements\"><\/span><strong>Search for html elements<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p id=\"5150\" class=\"pw-post-body-paragraph kq kr jt ks b kt nb kv kw kx nc kz la lb nd ld le lf ne lh li lj nf ll lm ln jb gc\" data-selectable-paragraph=\"\">As all of the results are contained within a table, we can search the soup object for the table using the\u00a0<code class=\"fr nu nv nw nx b\">find<\/code>\u00a0method. We can then find each row within the table using the\u00a0<code class=\"fr nu nv nw nx b\">find_all<\/code>\u00a0method.<\/p>\n<p id=\"0385\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">If we print the number of rows we should get a result of 101, the 100 rows plus the header.<\/p>\n<pre class=\"ny nz oa ob gz oc bt od\"><span id=\"fb05\" class=\"gc lp lq jt nx b do oe of l og\" data-selectable-paragraph=\"\"># find results within table\r\ntable = soup.find('table', attrs={'class': 'tableSorter'})\r\nresults = table.find_all('tr')\r\nprint('Number of results', len(results))<\/span><\/pre>\n<p id=\"769a\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">We can therefore loop over the results to gather the data.<\/p>\n<p id=\"5e68\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">Printing the first 2 rows in the soup object, we can see that the structure of each row is:<\/p>\n<pre class=\"ny nz oa ob gz oc bt od\"><span id=\"b20b\" class=\"gc lp lq jt nx b do oe of l og\" data-selectable-paragraph=\"\">&lt;tr&gt;\r\n&lt;th&gt;Rank&lt;\/th&gt;\r\n&lt;th&gt;Company&lt;\/th&gt;\r\n&lt;th class=\"\"&gt;Location&lt;\/th&gt;\r\n&lt;th class=\"no-word-wrap\"&gt;Year end&lt;\/th&gt;\r\n&lt;th class=\"\" style=\"text-align:right;\"&gt;Annual sales rise over 3 years&lt;\/th&gt;\r\n&lt;th class=\"\" style=\"text-align:right;\"&gt;Latest sales \u00a3000s&lt;\/th&gt;\r\n&lt;th class=\"\" style=\"text-align:right;\"&gt;Staff&lt;\/th&gt;\r\n&lt;th class=\"\"&gt;Comment&lt;\/th&gt;\r\n&lt;!--                            &lt;th&gt;FYE&lt;\/th&gt;--&gt;\r\n&lt;\/tr&gt;\r\n&lt;tr&gt;\r\n&lt;td&gt;1&lt;\/td&gt;\r\n&lt;td&gt;&lt;a href=\"http:\/\/www.fasttrack.co.uk\/company_profile\/wonderbly-3\/\"&gt;&lt;span class=\"company-name\"&gt;Wonderbly&lt;\/span&gt;&lt;\/a&gt;Personalised children's books&lt;\/td&gt;\r\n&lt;td&gt;East London&lt;\/td&gt;\r\n&lt;td&gt;Apr-17&lt;\/td&gt;\r\n&lt;td style=\"text-align:right;\"&gt;294.27%&lt;\/td&gt;\r\n&lt;td style=\"text-align:right;\"&gt;*25,860&lt;\/td&gt;\r\n&lt;td style=\"text-align:right;\"&gt;80&lt;\/td&gt;\r\n&lt;td&gt;Has sold nearly 3m customisable children\u2019s books in 200 countries&lt;\/td&gt;\r\n&lt;!--                                            &lt;td&gt;Apr-17&lt;\/td&gt;--&gt;\r\n&lt;\/tr&gt;<\/span><\/pre>\n<p id=\"115f\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">There are 8 columns in the table containing: Rank, Company, Location, Year End, Annual Sales Rise, Latest Sales, Staff and Comments, all of which are interesting data that we can save.<\/p>\n<p id=\"18a5\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">This structure is consistent throughout all rows on the webpage (which may not always be the case for all websites!), and therefore we can again use the\u00a0<code class=\"fr nu nv nw nx b\">find_all<\/code>\u00a0method to assign each column to a variable that we can write to a csv or JSON by searching for the\u00a0<code class=\"fr nu nv nw nx b\">&lt;td&gt;<\/code>\u00a0element.<\/p>\n<table>\n<tbody>\n<tr>\n<td colspan=\"3\">\n<h5 style=\"text-align: center;\"><strong>Are you aspiring for a booming career in IT? If YES, then dive in<\/strong><\/h5>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<h5><a href=\"https:\/\/entri.app\/course\/full-stack-developer-course\/\"><strong>Full Stack Developer Course<\/strong><\/a><\/h5>\n<\/td>\n<td>\n<h5><a href=\"https:\/\/entri.app\/course\/python-programming-course\/\"><strong>Python Programming Course<\/strong><\/a><\/h5>\n<\/td>\n<td>\n<h5><a href=\"https:\/\/entri.app\/course\/data-science-and-machine-learning-course\/\"><strong>Data Science and Machine Learning Course<\/strong><\/a><\/h5>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2 id=\"28e8\" class=\"mk lq jt bn lr ml mm mn lv mo mp mq lz mr ms mt mc mu mv mw mf mx my mz mi na gc\"><span class=\"ez-toc-section\" id=\"Looping_through_elements_and_saving_variables\"><\/span><strong>Looping through elements and saving variables<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p id=\"a467\" class=\"pw-post-body-paragraph kq kr jt ks b kt nb kv kw kx nc kz la lb nd ld le lf ne lh li lj nf ll lm ln jb gc\" data-selectable-paragraph=\"\">In python, it is useful to append the results to a list to then write the data to a file. We should declare the list and set the headers of the csv before the loop with the following:<\/p>\n<pre class=\"ny nz oa ob gz oc bt od\"><span id=\"3139\" class=\"gc lp lq jt nx b do oe of l og\" data-selectable-paragraph=\"\"># create and write headers to a list \r\nrows = []\r\nrows.append(['Rank', 'Company Name', 'Webpage', 'Description', 'Location', 'Year end', 'Annual sales rise over 3 years', 'Sales \u00a3000s', 'Staff', 'Comments'])\r\nprint(rows)<\/span><\/pre>\n<p id=\"f51b\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">This will print out the first row that we have added to the list containing the headers.<\/p>\n<p id=\"8723\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">You might notice that there are a few extra fields\u00a0<code class=\"fr nu nv nw nx b\">Webpage<\/code>\u00a0and\u00a0<code class=\"fr nu nv nw nx b\">Description<\/code>\u00a0which are not column names in the table, but if you take a closer look in the html from when we printed the soup variable above, the second row contains more than just the company name. We can use some further extraction to get this extra information.<\/p>\n<p id=\"3b27\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">The next step is to loop over the results, process the data and append to\u00a0<code class=\"fr nu nv nw nx b\">rows\u00a0<\/code>which can be written to a csv.<\/p>\n<p id=\"ff5f\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">To find the results in the loop:<\/p>\n<pre class=\"ny nz oa ob gz oc bt od\"><span id=\"11b1\" class=\"gc lp lq jt nx b do oe of l og\" data-selectable-paragraph=\"\"># loop over results\r\nfor result in results:\r\n    # find all columns per result\r\n    data = result.find_all('td')\r\n    # check that columns have data \r\n    if len(data) == 0: \r\n        continue<\/span><\/pre>\n<p id=\"0a26\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">Since the first row in the table contains only the headers, we can skip this result, as shown above. It also does not contain any\u00a0<code class=\"fr nu nv nw nx b\">&lt;td&gt;<\/code>\u00a0elements so when searching for the element, nothing is returned. We can then check that only results containing data are processed by requiring the length of the data to be non-zero.<\/p>\n<p id=\"f470\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">We can then start to process the data and save to variables.<\/p>\n<pre class=\"ny nz oa ob gz oc bt od\"><span id=\"7991\" class=\"gc lp lq jt nx b do oe of l og\" data-selectable-paragraph=\"\">    # write columns to variables\r\n    rank = data[0].getText()\r\n    company = data[1].getText()\r\n    location = data[2].getText()\r\n    yearend = data[3].getText()\r\n    salesrise = data[4].getText()\r\n    sales = data[5].getText()\r\n    staff = data[6].getText()\r\n    comments = data[7].getText()<\/span><\/pre>\n<p id=\"6db6\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">The above simply gets the text from each of the columns and saves to variables. Some of this data however needs further cleaning to remove unwanted characters or extract further information.<\/p>\n<h2 style=\"text-align: center;\"><span class=\"ez-toc-section\" id=\"Crack_the_code_learn_python_in_malayalam\"><\/span><a class=\"btn btn-default\" href=\"https:\/\/entri.app\/course\/python-programming-course\/\">Crack the code : learn python in malayalam !<\/a><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<h2 id=\"7318\" class=\"mk lq jt bn lr ml mm mn lv mo mp mq lz mr ms mt mc mu mv mw mf mx my mz mi na gc\"><span class=\"ez-toc-section\" id=\"Data_Cleaning\"><\/span><strong>Data Cleaning<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p id=\"dccd\" class=\"pw-post-body-paragraph kq kr jt ks b kt nb kv kw kx nc kz la lb nd ld le lf ne lh li lj nf ll lm ln jb gc\" data-selectable-paragraph=\"\">If we print out the variable\u00a0<code class=\"fr nu nv nw nx b\">company<\/code>, the text not only contains the name of the company but also a description. If we then print out\u00a0<code class=\"fr nu nv nw nx b\">sales<\/code>, it contains unwanted characters such as footnote symbols that would be useful to remove.<\/p>\n<pre class=\"ny nz oa ob gz oc bt od\"><span id=\"4c19\" class=\"gc lp lq jt nx b do oe of l og\" data-selectable-paragraph=\"\">    print('Company is', company)\r\n    # Company is WonderblyPersonalised children's books          \r\n    print('Sales', sales)\r\n    # Sales *25,860<\/span><\/pre>\n<p id=\"5083\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">We would like to split\u00a0<code class=\"fr nu nv nw nx b\">company<\/code>\u00a0into the company name and the description which we can do in a few lines of code. Looking again at the html, for this column there is a\u00a0<code class=\"fr nu nv nw nx b\">&lt;span&gt;<\/code>\u00a0element that contains only the company name. There is also a link in this column to another page on the website that has more detailed information about the company. We will be using this a little later!<\/p>\n<pre class=\"ny nz oa ob gz oc bt od\"><span id=\"8fc7\" class=\"gc lp lq jt nx b do oe of l og\" data-selectable-paragraph=\"\">&lt;td&gt;&lt;a href=\"http:\/\/www.fasttrack.co.uk\/company_profile\/wonderbly-3\/\"&gt;&lt;span class=\"company-name\"&gt;Wonderbly&lt;\/span&gt;&lt;\/a&gt;Personalised children's books&lt;\/td&gt;<\/span><\/pre>\n<p id=\"729c\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">To separate\u00a0<code class=\"fr nu nv nw nx b\">company<\/code>\u00a0into two fields, we can use the<code class=\"fr nu nv nw nx b\">find<\/code>\u00a0method to save the\u00a0<code class=\"fr nu nv nw nx b\">&lt;span&gt;<\/code>\u00a0element and then use either\u00a0<code class=\"fr nu nv nw nx b\">strip<\/code>\u00a0or\u00a0<code class=\"fr nu nv nw nx b\">replace<\/code>\u00a0to remove the company name from the\u00a0<code class=\"fr nu nv nw nx b\">company<\/code>\u00a0variable, so that it leaves only the description.<br \/>\nTo remove the unwanted characters from\u00a0<code class=\"fr nu nv nw nx b\">sales<\/code>, we can again use<code class=\"fr nu nv nw nx b\">strip<\/code>\u00a0and\u00a0<code class=\"fr nu nv nw nx b\">replace<\/code>\u00a0methods!<\/p>\n<pre class=\"ny nz oa ob gz oc bt od\"><span id=\"0f3f\" class=\"gc lp lq jt nx b do oe of l og\" data-selectable-paragraph=\"\">    # extract description from the name\r\n    companyname = data[1].find('span', attrs={'class':'company-name'}).getText()    \r\n    description = company.replace(companyname, '')\r\n    \r\n    # remove unwanted characters\r\n    sales = sales.strip('*').strip('\u2020').replace(',','')<\/span><\/pre>\n<p id=\"9314\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">The last variable we would like to save is the company website. As discussed above, the second column contains a link to another page that has an overview of each company. Each company page has it\u2019s own table, which most of the time contains the company website.<\/p>\n<p id=\"c379\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">To scrape the url from each table and save it as a variable, we need to use the same steps as above:<\/p>\n<ul class=\"\">\n<li id=\"f186\" class=\"ng nh jt ks b kt ku kx ky lb ni lf nj lj nk ln nl nm nn no gc\" data-selectable-paragraph=\"\">Find the element that has the url of the the company page on the fast track website<\/li>\n<li id=\"9036\" class=\"ng nh jt ks b kt np kx nq lb nr lf ns lj nt ln nl nm nn no gc\" data-selectable-paragraph=\"\">Make a request to each company page url<\/li>\n<li id=\"5f7a\" class=\"ng nh jt ks b kt np kx nq lb nr lf ns lj nt ln nl nm nn no gc\" data-selectable-paragraph=\"\">Parse the html using Beautifulsoup<\/li>\n<li id=\"3bba\" class=\"ng nh jt ks b kt np kx nq lb nr lf ns lj nt ln nl nm nn no gc\" data-selectable-paragraph=\"\">Find the elements of interest<\/li>\n<\/ul>\n<p id=\"f4ed\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">Looking at a few of the company pages, as in the screenshot above, the urls are in last row in the table so we can search within the last row for the\u00a0<code class=\"fr nu nv nw nx b\">&lt;a&gt;<\/code>\u00a0element.<\/p>\n<pre class=\"ny nz oa ob gz oc bt od\"><span id=\"4449\" class=\"gc lp lq jt nx b do oe of l og\" data-selectable-paragraph=\"\">    # go to link and extract company website\r\n    url = data[1].find('a').get('href')\r\n    page = urllib.request.urlopen(url)\r\n    # parse the html \r\n    soup = BeautifulSoup(page, 'html.parser')\r\n    # find the last result in the table and get the link\r\n    try:\r\n        tableRow = soup.find('table').find_all('tr')[-1]\r\n        webpage = tableRow.find('a').get('href')\r\n    except:\r\n        webpage = None<\/span><\/pre>\n<p id=\"7d00\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">There also may be cases where the company website is not displayed so we can use a\u00a0<code class=\"fr nu nv nw nx b\">try<\/code>\u00a0<code class=\"fr nu nv nw nx b\">except<\/code>\u00a0condition, in case a url is not found.<\/p>\n<h4 style=\"text-align: center;\"><strong><a href=\"https:\/\/entri.app\/course\/python-programming-course\/\" target=\"_blank\" rel=\"noopener\">Grab the opportunity to learn Python with Entri! Click Here<\/a><\/strong><\/h4>\n<p id=\"933c\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">Once we have saved all of the data to variables, still within the loop, we can add each result to the list\u00a0<code class=\"fr nu nv nw nx b\">rows<\/code>.<\/p>\n<pre class=\"ny nz oa ob gz oc bt od\"><span id=\"aba3\" class=\"gc lp lq jt nx b do oe of l og\" data-selectable-paragraph=\"\">    # write each result to rows\r\n    rows.append([rank, companyname, webpage, description, location, yearend, salesrise, sales, staff, comments])<\/span><span id=\"0bc0\" class=\"gc lp lq jt nx b do ov ow ox oy oz of l og\" data-selectable-paragraph=\"\">print(rows)<\/span><\/pre>\n<p id=\"58af\" class=\"pw-post-body-paragraph kq kr jt ks b kt ku kv kw kx ky kz la lb lc ld le lf lg lh li lj lk ll lm ln jb gc\" data-selectable-paragraph=\"\">It is then useful to print the variable outside of the loop, to check that it looks as you expect before writing it to a file!<\/p>\n<p data-selectable-paragraph=\"\"><a href=\"https:\/\/entri.app\/course\/full-stack-developer-course\/\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-25522667 size-full\" src=\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Web-Development-Rectangle.png\" alt=\"Web Development Rectangle\" width=\"970\" height=\"250\" srcset=\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Web-Development-Rectangle.png 970w, https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Web-Development-Rectangle-300x77.png 300w, https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Web-Development-Rectangle-768x198.png 768w, https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/04\/Web-Development-Rectangle-750x193.png 750w\" sizes=\"auto, (max-width: 970px) 100vw, 970px\" \/><\/a><\/p>\n<table>\n<tbody>\n<tr>\n<td style=\"text-align: center;\" colspan=\"3\"><strong>Our Other Courses<\/strong><\/td>\n<\/tr>\n<tr>\n<td><a href=\"https:\/\/entri.app\/course\/mep-course\/\"><strong>MEP Course<\/strong><\/a><\/td>\n<td><a href=\"https:\/\/entri.app\/course\/quantity-surveying-course\/\"><strong>Quantity Surveying Course<\/strong><\/a><\/td>\n<td><a href=\"https:\/\/entri.app\/course\/montessori-teachers-training-course\/\"><strong>Montessori Teachers Training Course<\/strong><\/a><\/td>\n<\/tr>\n<tr>\n<td><a href=\"https:\/\/entri.app\/course\/performance-marketing-course\/\"><strong>Performance Marketing Course\u00a0<\/strong><\/a><\/td>\n<td><a href=\"https:\/\/entri.app\/course\/practical-accounting-course\/\"><strong>Practical Accounting Course<\/strong><\/a><\/td>\n<td><a href=\"https:\/\/entri.app\/course\/yoga-teachers-training-course\/\"><strong>Yoga Teachers Training Course<\/strong><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n","protected":false},"excerpt":{"rendered":"<p>One of the first tasks that I was given in my job as a Data Scientist involved Web Scraping. This was a completely unfamiliar concept to me at the time, gathering data from websites using code, but it is one of the most logical and easily accessible sources of data. Getting Started The first question [&hellip;]<\/p>\n","protected":false},"author":111,"featured_media":25523491,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[802,1903,1888],"tags":[],"class_list":["post-25523403","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-articles","category-coding","category-python-programming"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.6 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Data Science Skills: Web Scraping Using Python - Entri Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Data Science Skills: Web Scraping Using Python - Entri Blog\" \/>\n<meta property=\"og:description\" content=\"One of the first tasks that I was given in my job as a Data Scientist involved Web Scraping. This was a completely unfamiliar concept to me at the time, gathering data from websites using code, but it is one of the most logical and easily accessible sources of data. Getting Started The first question [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/\" \/>\n<meta property=\"og:site_name\" content=\"Entri Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/entri.me\/\" \/>\n<meta property=\"article:published_time\" content=\"2022-05-07T23:12:57+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-11-22T12:03:42+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Data-Science-Skills-Web-Scraping-Using-Python.png\" \/>\n\t<meta property=\"og:image:width\" content=\"820\" \/>\n\t<meta property=\"og:image:height\" content=\"615\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Feeba Mahin\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@entri_app\" \/>\n<meta name=\"twitter:site\" content=\"@entri_app\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Feeba Mahin\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/\"},\"author\":{\"name\":\"Feeba Mahin\",\"@id\":\"https:\/\/entri.app\/blog\/#\/schema\/person\/f036dab84abae3dcc9390a1110d95d36\"},\"headline\":\"Data Science Skills: Web Scraping Using Python\",\"datePublished\":\"2022-05-07T23:12:57+00:00\",\"dateModified\":\"2023-11-22T12:03:42+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/\"},\"wordCount\":1333,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/entri.app\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Data-Science-Skills-Web-Scraping-Using-Python.png\",\"articleSection\":[\"Articles\",\"Coding\",\"Python Programming\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/\",\"url\":\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/\",\"name\":\"Data Science Skills: Web Scraping Using Python - Entri Blog\",\"isPartOf\":{\"@id\":\"https:\/\/entri.app\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Data-Science-Skills-Web-Scraping-Using-Python.png\",\"datePublished\":\"2022-05-07T23:12:57+00:00\",\"dateModified\":\"2023-11-22T12:03:42+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#primaryimage\",\"url\":\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Data-Science-Skills-Web-Scraping-Using-Python.png\",\"contentUrl\":\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Data-Science-Skills-Web-Scraping-Using-Python.png\",\"width\":820,\"height\":615,\"caption\":\"Data Science Skills , Web Scraping Using Python\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/entri.app\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Python Programming\",\"item\":\"https:\/\/entri.app\/blog\/category\/python-programming\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Data Science Skills: Web Scraping Using Python\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/entri.app\/blog\/#website\",\"url\":\"https:\/\/entri.app\/blog\/\",\"name\":\"Entri Blog\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/entri.app\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/entri.app\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/entri.app\/blog\/#organization\",\"name\":\"Entri App\",\"url\":\"https:\/\/entri.app\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/entri.app\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2019\/10\/Entri-Logo-1.png\",\"contentUrl\":\"https:\/\/entri.app\/blog\/wp-content\/uploads\/2019\/10\/Entri-Logo-1.png\",\"width\":989,\"height\":446,\"caption\":\"Entri App\"},\"image\":{\"@id\":\"https:\/\/entri.app\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/entri.me\/\",\"https:\/\/x.com\/entri_app\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/entri.app\/blog\/#\/schema\/person\/f036dab84abae3dcc9390a1110d95d36\",\"name\":\"Feeba Mahin\",\"url\":\"https:\/\/entri.app\/blog\/author\/feeba123\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Data Science Skills: Web Scraping Using Python - Entri Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/","og_locale":"en_US","og_type":"article","og_title":"Data Science Skills: Web Scraping Using Python - Entri Blog","og_description":"One of the first tasks that I was given in my job as a Data Scientist involved Web Scraping. This was a completely unfamiliar concept to me at the time, gathering data from websites using code, but it is one of the most logical and easily accessible sources of data. Getting Started The first question [&hellip;]","og_url":"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/","og_site_name":"Entri Blog","article_publisher":"https:\/\/www.facebook.com\/entri.me\/","article_published_time":"2022-05-07T23:12:57+00:00","article_modified_time":"2023-11-22T12:03:42+00:00","og_image":[{"width":820,"height":615,"url":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Data-Science-Skills-Web-Scraping-Using-Python.png","type":"image\/png"}],"author":"Feeba Mahin","twitter_card":"summary_large_image","twitter_creator":"@entri_app","twitter_site":"@entri_app","twitter_misc":{"Written by":"Feeba Mahin","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#article","isPartOf":{"@id":"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/"},"author":{"name":"Feeba Mahin","@id":"https:\/\/entri.app\/blog\/#\/schema\/person\/f036dab84abae3dcc9390a1110d95d36"},"headline":"Data Science Skills: Web Scraping Using Python","datePublished":"2022-05-07T23:12:57+00:00","dateModified":"2023-11-22T12:03:42+00:00","mainEntityOfPage":{"@id":"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/"},"wordCount":1333,"commentCount":0,"publisher":{"@id":"https:\/\/entri.app\/blog\/#organization"},"image":{"@id":"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#primaryimage"},"thumbnailUrl":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Data-Science-Skills-Web-Scraping-Using-Python.png","articleSection":["Articles","Coding","Python Programming"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/","url":"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/","name":"Data Science Skills: Web Scraping Using Python - Entri Blog","isPartOf":{"@id":"https:\/\/entri.app\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#primaryimage"},"image":{"@id":"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#primaryimage"},"thumbnailUrl":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Data-Science-Skills-Web-Scraping-Using-Python.png","datePublished":"2022-05-07T23:12:57+00:00","dateModified":"2023-11-22T12:03:42+00:00","breadcrumb":{"@id":"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#primaryimage","url":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Data-Science-Skills-Web-Scraping-Using-Python.png","contentUrl":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2022\/05\/Data-Science-Skills-Web-Scraping-Using-Python.png","width":820,"height":615,"caption":"Data Science Skills , Web Scraping Using Python"},{"@type":"BreadcrumbList","@id":"https:\/\/entri.app\/blog\/data-science-skills-web-scraping-using-python\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/entri.app\/blog\/"},{"@type":"ListItem","position":2,"name":"Python Programming","item":"https:\/\/entri.app\/blog\/category\/python-programming\/"},{"@type":"ListItem","position":3,"name":"Data Science Skills: Web Scraping Using Python"}]},{"@type":"WebSite","@id":"https:\/\/entri.app\/blog\/#website","url":"https:\/\/entri.app\/blog\/","name":"Entri Blog","description":"","publisher":{"@id":"https:\/\/entri.app\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/entri.app\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/entri.app\/blog\/#organization","name":"Entri App","url":"https:\/\/entri.app\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/entri.app\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2019\/10\/Entri-Logo-1.png","contentUrl":"https:\/\/entri.app\/blog\/wp-content\/uploads\/2019\/10\/Entri-Logo-1.png","width":989,"height":446,"caption":"Entri App"},"image":{"@id":"https:\/\/entri.app\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/entri.me\/","https:\/\/x.com\/entri_app"]},{"@type":"Person","@id":"https:\/\/entri.app\/blog\/#\/schema\/person\/f036dab84abae3dcc9390a1110d95d36","name":"Feeba Mahin","url":"https:\/\/entri.app\/blog\/author\/feeba123\/"}]}},"_links":{"self":[{"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/posts\/25523403","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/users\/111"}],"replies":[{"embeddable":true,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/comments?post=25523403"}],"version-history":[{"count":12,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/posts\/25523403\/revisions"}],"predecessor-version":[{"id":25568842,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/posts\/25523403\/revisions\/25568842"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/media\/25523491"}],"wp:attachment":[{"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/media?parent=25523403"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/categories?post=25523403"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/entri.app\/blog\/wp-json\/wp\/v2\/tags?post=25523403"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}