The man that disrupted Macedonian online media using clustering algorithms
The lack of local Yahoo/Google News as Trajkovski’s opportunity to transform the Macedonian web
Although Macedonia is sometimes known as the country that produced fake news during the 2016 US presidential election, it has a positive phenomenon regarding online news that fascinated me during my teenage years. The most popular local website1 is TIME.MK – a news aggregator that manages to aggregate credible online newspapers in a transparent and user-friendly fashion that I hadn’t seen before – especially since there’s no local Google/Yahoo News. And citizens love it, and many attempt to create clones of it.
TIME.MK was developed by a person named Igor Trajkovski. In this article, I’ll talk to him2, find out how it was launched and learn about his background, inspirations and values.
Filip: Before talking about your background and your projects, I’d like to ask: What is TIME.MK? If we were stuck in a 2-minute elevator conversation, how would you describe it?
Igor: TIME.mk is a news aggregation platform that collects news articles from the biggest news websites in North Macedonia and by using the techniques from Natural Language Processing and Computational Journalism computes the top stories of the day. The way it is done is by first finding the stories, in the collection of news, by using techniques of document clustering and then ranking them according to the general criteria and proprietary algorithms.
The general criteria are: Bigger clusters are preferred, newer news are preferred, news from authoritative sources are preferred and diversified stories/clusters (by the number of sources that cover it) are preferred. The algorithm that runs behind TIME.mk is an amalgam of these four principles, and many other heuristics, that were learned through the 15 years of running the news aggregator.
Filip: The concept of recent news-first and ordering by cluster size make sense. Without leaking the whole pipeline or magic sauce behind TIME.MK, can you share an example of an algorithmic technique or a class of techniques you've been using?
Igor: For clustering, we employ a highly optimized and tailored version of the Hierarchical Agglomerative Clustering (HAC) algorithm, with specific stopping criteria adjustments for merging. To measure the distance between clusters, we use the Word Mover's Distance Algorithm, which leverages word embeddings trained on an extensive corpus of 20 million news articles from 2001. In their basic forms, both algorithms are quite slow, making it challenging to complete the entire cycle of crawling, clustering, and ranking of 10k news in under a minute. However, through intelligent caching and optimizing recurring tasks in each cycle, achieving this speed becomes feasible.
Filip: 15 years is a long time. What was the inspiration for building it, and how did it start? I know it had something to do with your personal decision to move outside Macedonia.
Igor: Yes. Living abroad from 2001 to 2008, I was interested in Macedonian affairs. I regularly browsed various news portals like A1, Dnevnik, Vecer, etc. I noticed that different sources highlighted different top stories and some even censored certain events or topics. To get a comprehensive view of major events, one had to visit all the major news outlets. Google News, which emerged post-9/11, addressed this issue but lacked a Macedonian edition, a common scenario even for larger countries at the time.
This gap sparked my initial inspiration. I was curious whether the news aggregation concept would be effective in a smaller country like Macedonia and if text processing algorithms could adapt to the Macedonian language. To be honest, I also looked for a solution to the problem of passive income, the real application of IT in general and AI in particular, while spending the abundant free time reading AI papers/news, philosophy, psychology and history.
Filip: What were the biggest challenges during these 15 years?
Igor: TIME.mk was launched in July 2008. Initially, during the first three years, it wasn't the most visited website. Its concept of news aggregation seemed to resonate primarily with a more educated audience. Many, even to this day, mistakenly perceive TIME.mk as a news source rather than a tool for accessing news. This misconception even led some news editors to believe we were plagiarizing their content, raising concerns about intellectual property infringement, though legal action was never pursued.
A pivotal moment in TIME.mk's history occurred in 2011, an event I liken to the mammalian rise 65 million years ago. This was when the major news source and largest TV station, a1.com.mk, was shut down. This led to the emergence of numerous opposition news websites, formed by former A1 journalists. However, none could match A1's comprehensive coverage. During Macedonia's particularly polarized political climate, TIME.mk stood out as the only platform offering real-time coverage of events from diverse political perspectives, attracting visitors across the political spectrum.
Since then, TIME.mk has been the most visited website in Macedonia. We've expanded our services, adding features like TopTweets analysis from the Macedonian Twitter sphere, aggregating TV shows and political debates available on YouTube, and developing efficient search engines for public procurements. These enhancements have further strengthened our position in the news market.
We encountered two significant technical challenges. The first was algorithmic: clustering similar news stories, even when sources use varied vocabulary for the same topics. Modern NLP tools, particularly word embeddings, were instrumental in overcoming this challenge. The second challenge related to infrastructure, specifically DDOS (Distributed Denial of Service) attacks.
Being a major news source, we attracted various attackers, both local and international, motivated by monetary or political gain, aiming to disrupt our service. Developing robust defenses against these attacks posed a considerable challenge.
Filip: You mention that you lived abroad from 2001 to 2008. Reading your CV, you interned at Google and did research at Max Planck Institute for Computer Science in Germany and Jožef Stefan Institute in Slovenia. I’d like to know a few things:
The spicy one… What’s your opinion on publish-or-perish, p-hacking, grant and tenure negotiations in academia?
Do you have a favorite professor, course or a story during these years?
How does academia compare to the work at Google, and are R&D labs (ex. Microsoft Research or what used to be Google Brain) a viable alternative too? I’ve heard that some people choose the academic path instead of engineering because they can’t stand the deadlines and meetings :)
Igor:
I despise it. That 'publish-or-perish' mentality was one of the major reasons why I left my professorship, along with other issues related to how FINKI was managed. FINKI is the largest state Faculty for Computer Science in Macedonia. I never imagined that a research job could feel like working on a factory line, where you're expected to produce a certain number of papers within a set timeframe. This inevitably leads to unethical practices like p-hacking, plagiarism, self-plagiarism or recycling, and citation manipulation, all of which erode personal integrity. I'm ambivalent about the tenure system. It might be beneficial in social sciences, where it can protect professors' freedom to critique society without fear of losing their jobs. However, it also can be exploited by less productive professors who merely enjoy the status without fulfilling their responsibilities. This was a noticeable pattern at FINKI.
My favorite professor during my undergraduate studies was Prof. Dr. Oliver Popov. He taught an introductory course in computer science, focusing on computability and the philosophy related to the field. It's challenging to articulate the profound impact he had on us, instilling lifelong motivation and presenting computation as a reality-altering force. My favorite courses were 'Algorithms, Part I and II' from Princeton University, which I believe are fundamental to everything a computer scientist or programmer will ever build.
I've never held a research position in the industry, so I can't make a direct comparison. However, the research I conducted for my PhD was enjoyable. Although it required producing peer-reviewed journal papers, the timeframe was reasonable, and there wasn't excessive pressure or too many meetings. My industry experiences, including roles as a developer, left me less enthusiastic about agile development due to its frequent meetings and reports. Working in two Berlin startups, I found the division of responsibilities strange. In such environments, you're often expected to wear multiple hats – developer, tester, owner, dev-ops, etc. It was overwhelming to handle tasks beyond my expertise, like solving algorithmic problems, and also be responsible for implementing, testing, monitoring in production, and updating software as the environment evolved. There was an excess of tedious, uninteresting work.
Filip: So if I understand you correctly, It seems like being aware of your own competitive advantages was the key when it came to experimenting between academia, engineering or starting a business.
You mention philosophy and computability, so that’s what I’m aiming at for the final questions.
What’s your favorite philosopher and philosophy-related book, and do you have any thoughts on determinism, free will and gene editing?
Igor:
I'm particularly drawn to 18th and 19th-century German philosophers from the Enlightenment era. These thinkers were at the forefront of constructing meaning without relying on the concept of God. This challenge resonates with me as we, as humans, keep wrestling with questions about our place in the universe and our origins. Our search for meaning likely stems from a desire for control and predictability, necessitating a story or framework to make sense of our existence.
Among these German philosophers, Schopenhauer and Hegel stand out to me. From more contemporary thinkers, I highly recommend 'The Denial of Death' by the anthropologist Ernest Becker. Becker's core argument in this book is that the primary psychological function of culture is to offer humans ways to deny the inevitability and fear of death. He suggests that a significant portion of human behavior is driven by an unconscious fear of death, with cultural institutions and beliefs providing meaning, order, and a sense of permanence to our lives.
When discussing determinism and free will, it's crucial to first define these concepts in the specific context. I lean towards determinism, and thus, the perspective that free will is an illusion. Our representational and prediction systems are not capable of foreseeing the future, leading to this misconception. If the universe operates deterministically (as Einstein suggested, 'God does not play dice'), then free will seems implausible since everything is determined by previous states of reality. Even in a non-deterministic universe, the origin of free will remains unclear to me.
Regarding gene editing, I view it as an inevitable next step in the evolution of medicine, building upon past advancements like antibiotics, vaccines, prosthetics, stents, bypass surgeries, and pacemakers.
Filip: And for the final one. There’s a $10M prize for building an AI that wins an International Math Olympiad and there’s tons of buzz regarding ChatGPT and LLMs.
Also there are claims that AI-powered robots will beat pro soccer players soon.
If you were to make a rough guess (you may use uncertainty and probabilities as percentages), how would you answer these:
How soon will the AI IMO prize be claimed?
When will AI-powered robots beat pro soccer players?
How likely is it that LLM/transformer-based solutions will be the first ones to solve a major human problem (including, but not limited to finding a cancer cure or solving unsolved math problems), or even worse, cause one (unaligned AGI that wants to hurt humanity)?
Igor:
I predict that within 3-5 years, we'll see AI entities, not humans, winning gold medals at the International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI).
Regarding AI-powered robots in professional soccer, my forecast is that in less than a decade, a team of non-human players will be capable of defeating the world champions in soccer.
As for the potential of LLM/transformer-based solutions in solving or exacerbating major human issues, including but not limited to cancer or unsolved mathematical problems, there's significant potential. Current LLM systems, including finetuning and RLHF, could indeed aid in finding solutions. However, I have a gut feeling that they're primarily interpolating within the confines of their training datasets, lacking the necessary guidance for meaningful extrapolation beyond this data. Genuine creativity requires the ability to extrapolate and explore, along with a set of values, whether explicit (like survival instincts) or implicit (such as ethical or moral values), to drive progress. Present-day LLMs lack both of these capabilities.
I’d like to thank Igor for his time replying to my questions and hope that TIME.MK will continue with its good work.
Proven for many periods after 2011 by Alexa (now defunct) and Google Analytics data
I personally met him during a summer internship at TIME.MK when I worked on word2vec experiments for the Macedonian language, so this interview is not completely unbiased 🙂
All respects to Igor and Time.mk, but that opening sentence is not smart (populistic) as the discussion in the article.
Having the fact thah probably the author is Macedonian, he could have though of mentioning some smart and current facts about Macedonia.