Joe Diaz: Recent News and All the Big Stories

Okay, so today I’m gonna walk you through what I did with “joe diaz news”. I was messing around with some data stuff, trying to see if I could pull some news snippets related to this Joe Diaz character. Sounds simple, right? Well, kinda…

First thing I did, naturally, was to hit up Google. Just a basic search for “joe diaz news” to see what’s out there. I wanted to get a feel for the kind of info that’s available and where it’s coming from. Scraped a few pages manually, just copy-pasting stuff into a text file. Real caveman stuff, I know.

Then, I thought, “This is dumb, I need something more automated.” So, I started looking into web scraping libraries. I ended up settling on Beautiful Soup in Python. It’s pretty easy to use, and there are tons of tutorials online. Installed it using pip: pip install beautifulsoup4. Boom, done.

Next, I needed to figure out how to actually use Beautiful Soup. So, I picked a specific news site that seemed to have a bunch of articles mentioning Joe Diaz. I inspected the page source using Chrome’s developer tools (right-click, “Inspect”). This is crucial. You gotta find the HTML tags that contain the news headlines and summaries.

This part took a while. I was digging through all sorts of divs, spans, and a tags, trying to find a consistent pattern. Eventually, I found that the headlines were all inside <h3> tags with a specific class name (let’s say it was “article-title”, but don’t quote me on that ’cause I don’t remember exactly). The summaries were in <p> tags right after the headlines.

Okay, code time. I wrote a quick Python script to:

Fetch the HTML of the news site using the requests library (pip install requests if you don’t have it).
Parse the HTML with Beautiful Soup.
Find all the <h3> tags with the “article-title” class.
For each headline, grab the text and then find the next <p> tag to get the summary.
Print the headline and summary to the console.

My first run was a disaster. It was spitting out all sorts of random stuff, not just the headlines and summaries. Turns out, I was being too broad with my tag selection. I needed to be more specific in my Beautiful Soup queries. I tweaked the script a few times, adding more filters and conditions until I was getting mostly clean results.

The output was still kinda messy. There were extra spaces, weird characters, and some HTML tags that I missed. So, I added some string cleaning functions to the script. Used strip() to remove leading/trailing spaces and replaced some of the HTML entities with their actual characters (like “&” with “&”).

Now, the news snippets were looking pretty good. But I didn’t want to just print them to the console. I wanted to save them to a file. So, I modified the script to write the headlines and summaries to a text file, one article per line. I also added a timestamp to each entry, so I knew when I scraped it.

Finally, I decided to try a different approach. Instead of scraping a single news site, I wanted to use a news API. I found a free one (I can’t remember the exact one right now, but there are tons out there) that allowed me to search for news articles based on keywords. Signed up for an API key and then modified my script to use the API instead of scraping directly.

This was way easier! The API returned the results in JSON format, which is super easy to parse in Python. I just used the json library to load the data and then extracted the headlines and summaries. The data was much cleaner too, so I didn’t have to do as much cleaning.

In the end, I had two scripts: one for scraping a specific news site and one for using a news API. The API one was definitely the winner in terms of ease of use and data quality. But the scraping script was still useful for getting data from sites that don’t have APIs. It was a good learning experience overall, even though it was a bit of a pain in the butt at times. Now I have tons of “joe diaz news” that I don’t really know what to do with. Maybe some kind of sentiment analysis next?

Lessons Learned: