
This project can be broken down into three parts: gathering data, processing data, and analyzing data. gathering data I started by using SEM Rush’s Open.Trends service to find the top websites for…
This project can be broken down into three parts: gathering data, processing data, and analyzing data.
gathering data
I started by using SEM Rush’s Open.Trends service to find the top websites for each country across all industries. While this can be done manually, i automated the process using the Python libraries BeautifulSoup and Selenium-Python (you can also use the Requests library in this case, but I already had Selenium imported lol). Here’s some pseudo-code to give you an idea of how it was done:
# run a function to get the list of countries Open.Trends has listed on their site
countries = getCountries()
# initialize a dictionary to store the information
d = {
'country':[],
'website':[],
'visits':[]
}
# iterate through that list
for country in countries:
# follow semrush's URL formatting and plug in the country using a formatted string
url = f'https://www.semrush.com/trending-websites/{country}/all'
# navigate to the URL using Selenium Webdriver
driver.get(url)
# feed the page information into BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
# extract the table data using BeautifulSoup
results = getTableData(soup)
# feed the results into the dictionary
d['country'] = results['country']
d['website'] = results['website']
d['visits'] = results['visits']
# save this into some sort of file
df = pandas.DataFrame(d)
df.save_csv('popular_websites.csv', index=False) NOTE: the quality of this data is subject to the accuracy of SEM rush’s methods. i didn’t really look too deeply into that because their listings were comparable to similar services.
You should now have a dictionary of the most popular websites in each country. A lot of those websites will be porn or malware or both. Let’s try to filter some of those out using the Cyren URL Lookup API. This is a service that uses “machine learning, heuristics, and human analysis” to categorize websites.
Here’s more pseudocode:
# iterate through all the websites we found
for i in range(len(df['website'])):
# select the website
url = df.loc[i,'website']
# call the API on the website
category = getCategory(url)
# save the results
df.loc[i,'category'] = category
# filter out all the undesireable categories
undesireable = [...]
df = df.loc[df['category'] in undesireable]
# save this dataframe to avoid needing to do this all over again
df.save_csv('popular_websites_filtered.csv', index=False)
NOTE: Cyren URL Lookup API has 1,000 free queries per month per user.
COMPLETELY SEPARATE NOTE: You can use services like temp-mail to create temporary email addresses.
Now it’s time to get some screenshots of the websites! If you want to take fullpage screenshots, you will need to use Selenium-Python’s Firefox webdriver. If not, any webdriver is fine. However, you probably don’t want to use full page screenshots as webpage sizes vary a lot and this can make your final results less interpretable.
def acceptCookies(...):
# this function will probably consistent of a bunch of try-exception blocks
# in search of a button that says accept/agree/allow cookies in every language
# ngl i gave up like 1/3 of the way through
def notBot(...):
# some websites will present a captcha before giving you access
# there are ways to beat that captcha
# i didn't even try but you should
# iterate through websites
for i in range(len(df['website'])):
url = df.loc[i,'website]
driver.get(url)
# wait for the page to load
# you shouldn't really use static sleep calls but i did
sleep(5)
notBot(driver)
sleep(2)
acceptCoookies(driver)
sleep(2)
# take screenshots
driver.save_screenshot(f'homepage_{country.upper()}_{url}.png')
# this call only exists for firefox webdrivers
driver.save_full_page_screenshot(f'homepage_{country.upper()}_{url}.png') NOTE: When doing this, you can use a VPN to navigate to the appropriate country / region to get increase the likelihood of seeing the local web page.
processing data
i mostly followed this tutorial by Grigory Serebryakov on LearnOpenCV. It utilizes an implementation of a ResNet model to extract the features of an image. You can pull the code from his blog post but we do need to load our images in differently. We can use this method by andrewjong (source). We need to save the image file paths for use in our final visualization.
class ImageFolderWithPaths(datasets.ImageFolder):
"""Custom dataset that includes image file paths. Extends
torchvision.datasets.ImageFolder
"""
# override the __getitem__ method. this is the method that dataloader calls
def __getitem__(self, index):
# this is what ImageFolder normally returns
original_tuple = super(ImageFolderWithPaths, self).__getitem__(index)
# the image file path
path = self.imgs[index][0]
# make a new tuple that includes original and the path
tuple_with_path = (original_tuple + (path,))
return tuple_with_path now we can load our images using that method.
# identify the path containing all your images
# if you want them to be labeled by country, you will need to sort them into folders
root_path = '...'
# transform the data so they are identical shapes
transform = transforms.Compose([transforms.Resize((255, 255)),
transforms.ToTensor()])
dataset = ImageFolderWithPaths(root, transform=transform)
# load the data
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True) next we initialize and run our model. I needed to adapt Serebryakov’s code slightly to account for how our images were loaded.
# initialize model
model = ResNet101(pretrained=True)
model.eval()
model.to(device)
# initialize variables to store results
features = None
labels = []
image_paths = []
# run the model
for batch in tqdm(dataloader, desc='Running the model inference'):
images = batch[0].to('cpu')
labels += batch[1]
image_paths += batch[2]
output = model.forward(images)
# convert from tensor to numpy array
current_features = output.detach().numpy()
if features is not None:
features = np.concatenate((features, current_features))
else:
features = current_features
# return labels too their string interpretations
labels = [dataset.classes[e] for e in labels]
# save the data
np.save('images.npy', images)
np.save('features.npy', features)
with open('labels.pkl', 'wb') as f:
pickle.dump(labels, f)
with open('image_paths.pkl', 'wb') as f:
pickle.dump(image_paths, f) we should now have 4 sets of data containing our image paths, labels, images, and their extracted features.
analyzing data
we start by running our data through sci-kit’s tsne implementation. This basically reduces our multidimensional feature arrays down to 2D co-ordinates that we can put on a graph. We can map smaller versions of our screenshots onto those coordinates to see how the machine has organized our websites.
# the s in t-SNE stands for stochastic (random)
# let's set a seed for reproducible results
seed = 10
random.seed(seed)
torch.manual_seed(seed)
np.random.seed(seed)
# run tsne
n_components = 2
tsne = TSNE(n_components)
tsne_result = tsne.fit_transform(features)
# scale and move the coordinates so they fit [0; 1] range
tx = scale_to_01_range(tsne_result[:,0])
ty = scale_to_01_range(tsne_result[:,1)
# plot the images
for image_path, image, x, y in zip(image_paths, images, tx, ty):
# read the image
image = cv2.imread(image_path)
# resize the image
image = cv2.resize(image, (150,100))
# compute the dimensions of the image based on its tsne co-ordinates
tlx, tly, brx, bry = compute_plot_coordinates(image, x, y)
# put the image to its t-SNE coordinates using numpy sub-array indices
tsne_plot[tl_y:br_y, tl_x:br_x, :] = image
cv2.imshow('t-SNE', tsne_plot)
cv2.waitKey() now we can look for any visual patterns in the images. What i found was detailed in the sections above.
i wanted to understand this data through the lens of writing systems, culture (geographically and economically), and technology. So, I found datasets containing that information: writing systems, iso countries with regional codes, and the global north-south divide. They needed to be supplemented with some additional Google searching to make sure we had labels for each country in our dataset.
Here’s a basic walkthrough of how I used this new analysis data.
analysis_data = # import data
# initialize a list to capture a parallel set of labels
# so instead of the country, we can label our data through writing system, etc.
new_labels = []
# iterate through our pre-existing labels and use it to inform our new_labels
for label in labels:
# select the new_label based on the old label (the country name)
new_label = analysis_data['country' == label]
new_labels.append(new_label)
# use the new_labels to colour a scatterplot with our tsne_results
tsne_df = pd.DataFrame({'tsne_1': tx, 'tsne_2': ty, 'label': new_labels})
sns.scatterplot(x='tsne_1', y='tsne_2', data=tsne_df, hue='label') NOTE: The technology argument used a more qualitative methood and is not included here.
we can see the results of those comparisons in the sections above

an answer in progress project
I read this piece when it came out in 2022. Maybe it should be marked with "(2022)". Previous discussion https://news.ycombinator.com/item?id=33745146
I just want to add that in addition to peculiar web design, Japanese websites have a way of assuming architectures or usage patterns where servers need to sleep or do some kind of scheduled job, which is really weird for people used to sites that need to account for a range of timezones or 24/7 availability (unless there is a pre-announced downtime that exists as a one-off thing). I know at least three websites off the top of my head that go down for "maintenance" at an exact scheduled time for hours every day, assuming that users would never want to access them overseas during those times (actually, one of those three doesn't even announce the reason, it just returns "server failed to respond" errors until it's time to "open up" for business again). Many services work fine, but at least a quarter to a half of Japanese web services are awful even though they eventually work if you can strangle yourself into making it work. The floor for Japanese web services is way below the floor for American ones. Those sites can get really mindnumbingly bad both on the front end and back end. I'm not sure what the cause is, but it must be a variety of factors. If tech-savvy users can't even make it work, I feel really bad for the struggling elders forced to use those sites.
I forget if it was Samsung or Sony, but somewhere along the way on my internet journey, someone claimed, without evidence, and thus I have none either, that the incentive structure for having prestige jobs at large technology companies was always in hardware design and software was seen as easier and more low class.
So since nobody will get any promotions for running good software, they are not incentivized to run good software, and therefore they do just enough to get by?
This is historically the reason software engineering in Japan has lagged and there's such a talent shortage (leading companies like mine to hire mostly foreign software engineers). I've heard it's changing, but it'll take a long time to catch up.
When I was working for Microsoft China, many of our foreign engineers were Korean and Japanese, who were in China for the higher paychecks.
Yes this is true and it might possibly be true for the rest of East Asia though I'm not sure. Software is considered intangible and thus low value that anyone can do, whereas hardware is a real "thing" that you can hold in your hands, and is therefore more prestigious. Well, this way of thinking has made things into the current state.
This was and partly is the attitude you can find in german non-software businesses where software is gaining more and more influenxe. For example car manufacturing.
I found this out when buying a Japan Rail Pass for a trip a few years ago, blew my mind.
https://www.japanrailpass-reservation.net/ only works 4:00–23:30 Japan time.
This is especially funny since the JR Pass cannot be purchased by residents of Japan.
Yeah this is probably downstream of the fact that if you visit any of the individual JR sites from the expandable map at the bottom, you'll discover they're all down at this time as well. Let's scrap the website and make a staffed phone line or fax machine with operating hours.
Considering the state of japanese IT, there is probably a person typing each reservation from the website into a 1980s mainframe.
After receiving the orders that were actually printed from an Internet Explorer 6 only website, and faxed over from another office before being re-scanned in along with a barcode that usually failed to make it over the fax, hence the need to hand-type things. True story (not for JR specifically, but circa 2013)
I've also had issues topping up my (virtual) Suica card late at night before.
Maybe that's when they run all those crazy legacy jobs, but they politely shut the site down for it.
Anyone who has attempted to play Final Fantasy XIV beyond the free trial has experienced this. Their subscription management web app is so incredibly bad it takes a significant amount of time and effort just to purchase a subscription. I wonder how much revenue they lose simply from people giving up.
I was bored and tried playing FF14 about a year ago. You need to do the usual download a launcher to download the game, fine. It asks you to log in before it'll download, fine. It crashes ~10% of the way through downloading the game. Not great but you can make it by restarting the launcher and trying again. And again and again, about a dozen times. It does eventually finish though, and I did almost successfully make a character. Except after making my character you have to choose a server instance - and every single instance in the NA server I could find was "full". I don't know if it was actually full or erroring but I gave up at that point.
The buttonology is cryptic. Like you asked tasked enterprise java devs to write frontend in jquery.
At least that's how I remember it. Game might be fun, but I'll never know.
So you didn’t even get to the final boss, purchasing a sub.
While I played it I always had this dirty feeling imagining what the backend code must look like. Sends chills down my spine.
I played on my Playstation when I played a few years back, fortunately it was a seamless process! As parent comment said though, subscription process was almost user hostile for some reason.
I was wondering why the process was so convoluted. I thought it was because I was doing it from my phone and they just had a poor mobile site. Well, apparently they have a poor desktop site that has poor mobile support!
Let me tell you, as bad as the FF14 subscription process is, it's nothing compared to what they had for FF11 back in the day. We have it good!
A lot of Japanese websites also have to be tremendously over provisioned because of how regimented the country is. A friend of mine worked infrastructure for a local newspaper, and every day at 6PM they'd send a push notification to all their subscribers and had to provision for that peak. When he asked if they could smooth out traffic, send the notification to some folks a minute before, or a minute after he was almost thrown out of the room. "Japan runs on time. Not a minute early, not a minute late. On time".
The UK driving licence authority (DVLA) also has a period in which you can’t conduct a range of transactions overnight, but that’s because it interfaces with systems that still run batch jobs overnight and the cost of making it all 24/7 simply wasn’t worth it considering the demand.
Really having common maintenance windows makes things way easier. If you already have a service with a limited geographical range its not bad.
A pet peeve of mine — undated blogs :(
The US Social Security Administration website is available from 6am to 8pm, Monday to Friday (or at least it was that way a few years ago)
The service hours seem a bit wider nowadays [0], but not 24/7.
[0] https://www.ssa.gov/myssa-static/rel_1.0/offHoursPopup.html
I’ve heard such things in the US were because of accessibility law that required the website (for the general population) to work no better than the associated call center (for the people who can’t interact with the website for whatever reason).
On one hand, that seems obviously stupid. On the other, I don’t see how you could phrase a legal requirement of this nature.
That's better than my assumption, which was that it was running off the Visual Foxpro instance on somebody's desktop and that guy had to be logged in for it to work.
this is also relatively common in Denmark, at least for government sites. One common thing you see (saw, haven't noticed in the last couple years) in Danish .gov sites is queuing where you need to wait some time before you are allowed in to use a site.
Getting ready for a trip to Japan, I spent an embarrassing amount of time troubleshooting failures to load a Suica (train/transit) NFC card on a phone before realizing it just doesn’t work a few hours a night Tokyo time.
The Suica app doesn't even work on my Pixel 10 Pro, since it requires an Android phone with some sort of Japan-specific hardware (FeliCa/Osaifu-Keitai technology, whatever that is, I'm assuming some special NFC or secure enclave sort of thing).
One of the worst sites in existence is the Japanese Visa site they direct people to to make QR codes for when you land in Japan as a tourist. It's atrocious.
https://services.digital.go.jp/en/visit-japan-web/
I hate it so much I kind of wish I could volunteer to fix it. I suspect the process though would be torture
Note: experience on mobile is bad. I don't remember if desktop is better.
I live in Japan and every time I go through the airport I refuse to use the QR code customs forms, the old paper based form is so much easier...
Probably the old habit of batch processing.
if you're talking about the train booking site going down -- struggling elders are still using the face to face or phone support. they probably have never made an online reservation.
I prefer the Japanese style. Information dense, yet clean. It reminds me of the web before Apple-style minimalism took over.
To contrast with a superficially similar style, Chinese web stores are also maximalist, but they tend to assault you with popup coupons, confetti effects, and other such things. Japanese style feels very efficient and utilitarian by comparison.
>"It reminds me of the web before Apple-style minimalism took over."
The loss of color and texture is my biggest gripe. So many webpages and user interfaces abandoned the idea of distinguishing components using different colors and just went with making the page as close to bleach white as possible. I suppose an upside of this is that it made dark-mode easier to adopt. That being said, good dark mode support seems relatively recent.
And now all AI slop coded by anyone is that. Tell tale signs: AI likes to make cards, implement SVGs by hand, all cards have a left hihghlight border, off center font spacing, badges and notification icons, etc.
I think you made a good observation about what’s in essence different between the Chinese style and the Japanese style. The popup coupons and confetti effects are all animations. Personally I find these animations highly distracting. Whereas if something is information dense but static, I like it.
(There are also non-store Chinese designs; they are not trying to sell anything so they don’t need coupons and confettis. These are actually enjoyable to use. And they are more information dense than the English equivalent because the Chinese script packs more in a smaller space. This of course makes such designs i18n-hostile.)
It reminds me of the “portal” era of Netscape, Excite and Yahoo. Very information dense. Among others’, Google’s minimalism took over.
There are still a few information dense English language sites out there, but they’re rarer. Honorable mentions:
- https://based.cooking/ (or the more updated fork https://publicdomainrecipes.com/)
- HN :)
(These are primarily text and lack the occasional color pop of the Japanese style, but I still admire the density and efficiency.)
I felt like part of Google's success was that the simple search bar loaded fast in an era where I often had slow internet. Yahoo's portal page had to much on it to distract or slow me down from doing what I came there to do.
Later on I remember finding out Yahoo had a search.yahoo.com page or something that was also just a search bar but that was harder to type so was still a failure of design.
This was before combined search and address bar.
It would not surprise me that Yahoo Japan was the blueprint for many of these sites. It still is extremely popular as a portal destination.
They feel like paper catalogues!
Yes, this was the portal style and I still adore it and use it myself, where I can. As long as the page has a scannable information hierarchy, information dense sites are better when you just want to get stuff done (/look stuff up), which for me is most of the time. I don't care about the fluff and "hero images" and the rest.
> Apple-style minimalism took over.
To be fair, it was Microsoft-style minimalism that Jony Ive brought to Apple, who then popularized it.
Do you actually use Japanese websites on frequently? Because I do live in Japan, and I hate their websites with a passion. Go use any Japanese online shop; the purchase flows are usually absurdly convoluted, and they are so information dense that sometimes you don't know what you are actually going to purchase. It is one of the reasons I rarely use Rakuten anymore...
Yeah, I hate to say it, but using Amazon.co.jp is SO refreshing after using a Japanese website. It's really unbelievable how bad most Japanese e-commerce sites are.
The technology argument is the most convincing one to me. I worked with a Japanese client a few years ago and the internal tools they used were wild by western standards. Like full-on frameset layouts in 2020. But it wasn't ignorance, it was continuity. The tools worked, people knew how to use them, and there was zero appetite for redesigning something that wasn't broken.
The font thing is also underrated as a factor. When you only have a handful of web-safe CJK fonts and you can't rely on weight/size variations to create hierarchy the way you can with Latin text, you compensate with color and density. It's a constraint that pushes you toward a specific aesthetic whether you want it or not.
I think the framing of "peculiar" is a bit western-centric though. Dense information-heavy pages are arguably more respectful of the user's time than the trend of spreading three sentences across five viewport-heights of whitespace.